[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2019-07-03 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877934#comment-16877934
 ] 

Rinka Singh edited comment on LUCENE-7745 at 7/3/19 3:56 PM:
-

{quote}The basic idea is to compute sub-histograms in each thread block with 
each thread block accumulating into the local memory. Then, when each thread 
block finishes its workload, it atomically adds the result to global memory, 
reducing the overall amount of traffic to global memory.To increase throughput 
and reduce shared memory contention, the main contribution here is that they 
actually use R "replicated" sub-histograms in each thread block, and they 
offset them so that bin 0 of the 1st histogram falls into a different memory 
bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, 
it improves throughput in the degenerate case where multiple threads are trying 
to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:

I have a basic histogramming (including eliminating stop words) working on a 
single GPU  (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 
5MB (text file) and it seems to be working OK.

The following is how I'm implementing it - briefly.

Read a file in from command line (linux executable) into the GPU
 * convert the stream to words, chunk them into blocks
 * eliminate the stop words
 * sort/merge (including word-count) everything first inside a block and then 
across blocks - I came up with my own sort - haven't had the time to explore 
the parallel sorts out there
 * This results in a sorted histogram is held in multiple blocks in the GPU.

The advantages of this approach (to my mind) are:
 * i can scale up use the entire GPU memory.  My guess is I can create and 
manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested 
with a 5 MB text file so far.
 * Easy to add fresh data into the existing histogram.  All I need to do is 
create new blocks and sort/merge them all.
 * I'm guessing this should make it easy to implement scaling across GPUs which 
means on a multi-GPU machine, I can scale to the almost the number of GPUs 
there and then of course one can setup a cluster of such machines...  This is 
far in the future though...
 * The sort is kept separate so we can experiment with various sorts and see 
which one performs best.

 

The issues are:
 * It is currently horrendously slow (I use global memory all the way and no 
optimization).  Well OK much too slow for my liking (I went over to nVidia's 
office and tested it on a K80 and it was just twice as fast as my GPU).  I'm 
currently trying to implement a shared memory version (and a few other tweaks) 
that should speed it up.
 * I have yet to do comparisons with the histogramming tools out there and so 
cannot say how much better it is.  Once I have the basic inverted index in 
place, I'll reach out to you all for the testing.
 * It is still a bit fragile - I'm still finding bugs as I test but the basic 
works.

 

Currently in process:
 * code is modified for (some) performance.  Am debugging/testing - it will 
take a while.  As of now, I feel good about what I've done but I won't know 
till I get it to work and test for performance.
 * Need to add ability to handle multiple files (I think I will postpone this 
as one can always cat the files together and pass it in - that is a pretty 
simple script that can be wrapped around the executable).
 * Need to create inverted index.
 * we'll worry about searching on the index later but that should be pretty 
trivial - well actually nothing is trivial here.

 
{quote}Re: efficient histogram implementation in CUDA

If it helps, [this 
approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3]
 has been good for a balance between GPU performance and ease of implementation 
for work I've done in the past. If academic paywalls block you for all those 
results, it looks to also be available (presumably by the authors) on 
[researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
 Took a quick look - they are all priced products.  I will take a look at 
researchgate sometime.

I apologize but I may not be very responsive in the next month or so as we are 
in the middle of a release at work and also my night time job (this).


was (Author: rinka):
{quote}The basic idea is to compute sub-histograms in each thread block with 
each thread block accumulating into the local memory. Then, when each thread 
block finishes its workload, it atomically adds the result to global memory, 
reducing the overall amount of traffic to global memory.To increase throughput 
and reduce shared memory contention, the main contribution here is that they 
actually use R "replicated" sub-histograms in each threa

[jira] [Commented] (LUCENE-7745) Explore GPU acceleration

2019-07-03 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877934#comment-16877934
 ] 

Rinka Singh commented on LUCENE-7745:
-

{quote}The basic idea is to compute sub-histograms in each thread block with 
each thread block accumulating into the local memory. Then, when each thread 
block finishes its workload, it atomically adds the result to global memory, 
reducing the overall amount of traffic to global memory.To increase throughput 
and reduce shared memory contention, the main contribution here is that they 
actually use R "replicated" sub-histograms in each thread block, and they 
offset them so that bin 0 of the 1st histogram falls into a different memory 
bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, 
it improves throughput in the degenerate case where multiple threads are trying 
to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:

I have a basic histogramming (including eliminating stop words) working on a 
single GPU  (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 
5MB (text file) and it seems to be working OK.

The following is how I'm implementing it - briefly.

Read a file in from command line (linux executable) into the GPU
 * convert the stream to words, chunk them into blocks
 * eliminate the stop words
 * sort/merge (including word-count) everything first inside a block and then 
across blocks - I came up with my own sort - haven't had the time to explore 
the parallel sorts out there
 * This results in a sorted histogram is held in multiple blocks in the GPU.

The advantages of this approach (to my mind) are:
 * i can scale up use the entire GPU memory.  My guess is I can create and 
manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested 
with a 5 MB text file so far.
 * Easy to add fresh data into the existing histogram.  All I need to do is 
create new blocks and sort/merge them all.
 * I'm guessing this should make it easy to implement scaling across GPUs which 
means on a multi-GPU machine, I can scale to the almost the number of GPUs 
there and then of course one can setup a cluster of such machines...  This is 
far in the future though...
 * The sort is kept separate so we can experiment with various sorts and see 
which one performs best.

 

The issues are:
 * It is currently horrendously slow (I use global memory all the way and no 
optimization).  Well OK much too slow for my liking (I went over to nVidia's 
office and tested it on a K80 and it was just twice as fast as my GPU).  I'm 
currently trying to implement a shared memory version (and a few other tweaks) 
that should speed it up.
 * I have yet to do comparisons with the histogramming tools out there and so 
cannot say how much better it is.  Once I have the basic inverted index in 
place, I'll reach out to you all for the testing.
 * It is still a bit fragile - I'm still finding bugs as I test but the basic 
works.

 

Currently in process:
 * code is modified for (some) performance.  Am debugging/testing - it will 
take a while.  As of now, I feel good about what I've done but I won't know 
till I test for performance.
 * Need to add ability to handle multiple files (I think I will postpone this 
as one can always cat the files together and pass it in - that is a pretty 
simple script that can be wrapped around the executable).
 * Need to create inverted index.
 * we'll worry about searching on the index later but that should be pretty 
trivial - well actually nothing is trivial here.

 
{quote}Re: efficient histogram implementation in CUDA

If it helps, [this 
approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3]
 has been good for a balance between GPU performance and ease of implementation 
for work I've done in the past. If academic paywalls block you for all those 
results, it looks to also be available (presumably by the authors) on 
[researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
 Took a quick look - they are all priced products.  I will take a look at 
researchgate sometime.

I apologize but I may not be very responsive in the next month or so as we are 
in the middle of a release at work and also my night time job (this).

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if comp

[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2019-06-19 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863938#comment-16863938
 ] 

Rinka Singh edited comment on LUCENE-7745 at 6/19/19 10:34 AM:
---

Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support 
out there...
 # Writing code:
 ## explicitly think about cache memory (that's your shared memory) and 
registers (local memory) and manage them.  This is COMPLETELY different from 
writing CPU code - the compiler does this for you.
 ## Try to used const, shared and register memory as much as possible.  Avoid 
__syncthreads () if you can.
 ## Here's where the dragons lie...
 # Engineer productivity is roughly 1/10th the normal productivity (And I mean 
writing C/C++ not python).  I've written and thrown away code umpteen times - 
something that I just wouldn't need to do when writing standard code.
 # Added on 6/19: As a metaphor think about this as VHDL/Verilog programming 
(without the timing constructs - timing is not a major issue here since the 
threads on the device execute in (almost) lockstep).

Having said all this, :) I've a bunch of limitations that a regular software 
engineer will not have and have been struggling to get over them.  I've been a 
manager for way too long and find it really difficult to focus on just one 
thing (the standard ADHD that most managers eventually develop).  Also, I WAS a 
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth 
to pick C++ up and then let's not even talk about my day job pressures - I do 
this for an hour or two at night (sigh)...

I will put out everything I've done once I've crossed a milestone (a working 
accelerated histogram).  Then will modify that to do inverted indexing.

Hope this helps...  In the meantime, if you want me to review 
documents/design/thoughts/anything, please feel free to mail them to me at: 
rinka (dot) singh (at) gmail.  At least ping me - I really don't look at 
the Apache messages and would probably miss something...

Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So 
please forgive me if I get the processes and systems wrong...  I'm trying to 
learn.


was (Author: rinka):
Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all 

[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2019-06-14 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863938#comment-16863938
 ] 

Rinka Singh edited comment on LUCENE-7745 at 6/14/19 10:43 AM:
---

Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support 
out there...
 # Writing code:
 ## explicitly think about cache memory (that's your shared memory) and 
registers (local memory) and manage them.  This is COMPLETELY different from 
writing CPU code - the compiler does this for you.
 ## Try to used const, shared and register memory as much as possible.  Avoid 
__syncthreads () if you can.
 ## Here's where the dragons lie...
 # Engineer productivity is roughly 1/10th the normal productivity (And I mean 
writing C/C++ not python).  I've written and thrown away code umpteen times - 
something that I just wouldn't need to do when writing standard code.

Having said all this, :) I've a bunch of limitations that a regular software 
engineer will not have and have been struggling to get over them.  I've been a 
manager for way too long and find it really difficult to focus on just one 
thing (the standard ADHD that most managers eventually develop).  Also, I WAS a 
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth 
to pick C++ up and then let's not even talk about my day job pressures - I do 
this for an hour or two at night (sigh)...

I will put out everything I've done once I've crossed a milestone (a working 
accelerated histogram).  Then will modify that to do inverted indexing.

Hope this helps...  In the meantime, if you want me to review 
documents/design/thoughts/anything, please feel free to mail them to me at: 
rinka (dot) singh (at) gmail.  At least ping me - I really don't look at 
the Apache messages and would probably miss something...

Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So 
please forgive me if I get the processes and systems wrong...  I'm trying to 
learn.


was (Author: rinka):
Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info a

[jira] [Commented] (LUCENE-7745) Explore GPU acceleration

2019-06-14 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863938#comment-16863938
 ] 

Rinka Singh commented on LUCENE-7745:
-

Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support 
out there...
 # Writing code:
 ## explicitly think about cache memory (that's your shared memory) and 
registers (local memory) and manage them.  This is COMPLETELY different from 
writing CPU code - the compiler does this for you.
 ## Try to used const, shared and register memory as much as possible.  Avoid 
__syncthreads () if you can.
 ## Here's where the dragons lie...
 # Engineer productivity is roughly 1/10th the normal productivity (And I mean 
writing C/C++ not python).  I've written and thrown away code umpteen times - 
something that I just wouldn't need to do when writing standard code.

Having said all this, :) I've a bunch of limitations that a regular software 
engineer will not have and have been struggling to get over them.  I've been a 
manager for way too long and find it really difficult to focus on just one 
thing (the standard ADHD that most managers eventually develop).  Also, I WAS a 
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth 
to pick C++ up and then let's not even talk about my day job pressures - I do 
this for an hour or two at night (sigh)...

I will put out everything I've done once I've crossed a milestone (a working 
accelerated histogram).  Then will modify that to do inverted indexing.

Hope this helps...  In the meantime, if you want me to review 
documents/design/thoughts/anything, please feel free to mail them to me at: 
rinka (dot) singh (at) gmail.  At least ping me - I really don't look at 
the Apache messages and would probably miss something...

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7745) Explore GPU acceleration

2018-12-16 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722717#comment-16722717
 ] 

Rinka Singh commented on LUCENE-7745:
-

Thank you.
As a first step let me come up with a GPU based index builder and we can look 
at query handling on the GPU as a second step.
:-) I AM going to be slooow - my apologies but I'm interested enough to put 
effort into this and will do my best.

Here's what I'll do as a first step:
Develop a stand alone executable (we can figure out modifications to directly 
use in Lucene as step 1.1) that will:
a. Read multiple files (command line) and come up with an inverted index
b. write the inverted index to stdout (I can generate a lucene index as step 
1.1)
c. will handle a stop-words file as a command line param
d. Will work on one GPU+1 thread of the CPU (I'll keep multi-GPU and multi 
threading to use all CPUs in mind but implementing that will be a separate step 
altogether).

Goal: Look at speed difference between an Index generated on the CPU vs GPU for 
just this.

We can build from there...
Thoughts please...

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-11-28 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702135#comment-16702135
 ] 

Rinka Singh edited comment on LUCENE-7745 at 11/28/18 4:57 PM:
---

Edited.  Sorry...

A few questions.
* How critical is the inverted index to the user experience?
* What happens if the inverted index is speeded up?
* How many AWS instances would usually be used for searching through ~140GB 
sized inverted index and are there any performance numbers around this? (I'd 
like to compare to a server with 8 GPUs costing about $135-140K) - not sure 
what the equivalent GPU instances on Google Cloud/AWS would cost... 

Assumptions (please validate):
 * Documents are being added to the inverted index however the Index itself 
doesn't grow rapidly
 * the Maximum Index size will be less than 140GB - I assume 8 GPUs


was (Author: rinka):
A few questions.  How critical is the inverted index to the user experience?  
What happens if the inverted index is speeded up?

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7745) Explore GPU acceleration

2018-11-28 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702135#comment-16702135
 ] 

Rinka Singh commented on LUCENE-7745:
-

A few questions.  How critical is the inverted index to the user experience?  
What happens if the inverted index is speeded up?

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-11-28 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701990#comment-16701990
 ] 

Rinka Singh edited comment on LUCENE-7745 at 11/28/18 3:08 PM:
---

[~jpountz]
{quote}(Unrelated to your comment Rinka, but seeing activity on this issue 
reminded me that I wanted to share something) There are limited use-cases for 
GPU accelelation in Lucene due to the fact that query processing is full of 
branches, especially since we added support for impacts and WAND.{quote}

While Yes branches do impact the performance, well designed (GPU) code will 
consist of a combo of both CPU (the decision making part) and GPU code.  For 
example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration 
and I also identified further code areas that can be improved.  I'm fairly sure 
(gut feel), I can squeeze out a 40-50x kind of improvement at the very least on 
a mid-sized GPU (given the time etc.,). I think things will be much, much 
better on a high end GPU and with further scale-up on a multi-gpu system...

My point is - thinking (GPU) parallel is a completely different ball-game and 
requires a mind-shift.  Once that happens, the value add will be massive and 
gut tells me Lucene is a huge opportunity.

Incidentally, this is why I want to develop a library that I can put out there 
for integration.

{quote}That said Mike initially mentioned that BooleanScorer might be one 
scorer that could benefit from GPU acceleration as it scores large blocks of 
documents at once. I just attached a specialization of a disjunction over term 
queries that should make it easy to experiment with Cuda, see the TODO in the 
end on top of the computeScores method.
{quote}

Lucene is really new to me (and so is working with Apache - sorry, I am a 
newbie to Apache) :). Please will you put links here...  


was (Author: rinka):
[~jpountz]
{quote}(Unrelated to your comment Rinka, but seeing activity on this issue 
reminded me that I wanted to share something) There are limited use-cases for 
GPU accelelation in Lucene due to the fact that query processing is full of 
branches, especially since we added support for impacts and WAND.{quote}

While Yes branches do impact the performance, well designed (GPU) code will 
consist of a combo of both CPU (the decision making part) and GPU code.  For 
example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration 
and I also identified further code areas that can be improved.  I'm fairly sure 
(gut feel), I can squeeze out a 40-50x kind of improvement at the very least on 
a mid-sized GPU (given the time etc.,). I think things will be much, much 
better on a high end GPU and with further scale-up on a multi-gpu system...

Incidentally, this is why I want to develop a library that I can put out there 
for integration.

{quote}That said Mike initially mentioned that BooleanScorer might be one 
scorer that could benefit from GPU acceleration as it scores large blocks of 
documents at once. I just attached a specialization of a disjunction over term 
queries that should make it easy to experiment with Cuda, see the TODO in the 
end on top of the computeScores method.
{quote}

Lucene is really new to me (and so is working with Apache - sorry, I am a 
newbie to Apache) :). Please will you put links here...  

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7745) Explore GPU acceleration

2018-11-28 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701990#comment-16701990
 ] 

Rinka Singh commented on LUCENE-7745:
-

[~jpountz]
{quote}(Unrelated to your comment Rinka, but seeing activity on this issue 
reminded me that I wanted to share something) There are limited use-cases for 
GPU accelelation in Lucene due to the fact that query processing is full of 
branches, especially since we added support for impacts and WAND.{quote}

While Yes branches do impact the performance, well designed (GPU) code will 
consist of a combo of both CPU (the decision making part) and GPU code.  For 
example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration 
and I also identified further code areas that can be improved.  I'm fairly sure 
(gut feel), I can squeeze out a 40-50x kind of improvement at the very least on 
a mid-sized GPU (given the time etc.,). I think things will be much, much 
better on a high end GPU and with further scale-up on a multi-gpu system...

Incidentally, this is why I want to develop a library that I can put out there 
for integration.

{quote}That said Mike initially mentioned that BooleanScorer might be one 
scorer that could benefit from GPU acceleration as it scores large blocks of 
documents at once. I just attached a specialization of a disjunction over term 
queries that should make it easy to experiment with Cuda, see the TODO in the 
end on top of the computeScores method.
{quote}

Lucene is really new to me (and so is working with Apache - sorry, I am a 
newbie to Apache) :). Please will you put links here...  

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7745) Explore GPU acceleration

2018-11-28 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701976#comment-16701976
 ] 

Rinka Singh commented on LUCENE-7745:
-

> The code is not worth a patch right now, but will soon have something. I 
> shall update on the latest state
> here as soon as I find myself some time (winding down from a hectic Black 
> Friday/Cyber Monday support schedule).

Do you think I could take a look at the code, I could do a quick review and 
perhaps add a bit of value.  I'm fine if the code is in dev state.

Would you have written up something to describe what you are doing?

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7745) Explore GPU acceleration

2018-11-28 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701855#comment-16701855
 ] 

Rinka Singh commented on LUCENE-7745:
-

Hi everyone,

I wanted to check if this issue was still open.  I have been experimenting with 
CUDA for a bit and would love to take a stab at this.

A few thoughts:
 * This is something I'll do over weekends and so I'm going to be horribly slow 
(its going to be just me on this unless you have someone working on it and I 
can collaborate with them) - would that be OK?
 * I think the right thing to do would be to build a CUDA library (C/C++), put 
JNI and then integrate it into Lucene.  If done right then I think this library 
will be useful to (and be possible to integrate with) other Analytic tools.
 * If I get it right, then I'd love to create an OS library that other OS tools 
can integrate and use (Yes, I'm thinking of an OpenCL port in the future but 
given the tools available in CUDA and my familiarity with it...)
 * Licensing is not an issue as I prefer the Apache License.
 * Testing (especially scalability testing) will be an issue - like you said, 
your setups won't have GPUs but would it be possible to rent a few GPU 
instances on the cloud (AWS, Google)?  I can do my dev testing locally as I 
have a GPU (its a  pretty old and obsolete one but good enough for my needs) on 
my dev machine.
 * It is important to get a few users who will experiment with this.  Can you 
guys help in having someone deploy, experiment and give feedback?
 * I would rather take something that is used by everyone and I'm thinking that 
indexing, filtering and searching is something that I would rather take up: 
[http://lucene.apache.org/core/7_5_0/demo/overview-summary.html#overview.description]
 ** These can certainly be accelerated.  I think I should be able to get some 
acceleration out of a GPU enabled search.
 ** The good part of this is one would able to scale volumes almost linearly on 
a multi-GPU machine.
 ** Related to the previous point (though this is in the future). I don't have 
a multi-GPU setup and will not be able to develop multi-GPU versions. I'll need 
help in getting the infrastructure to do that. We can talk about that once a 
single GPU version is done.
 ** Yes I agree that it will be better to have a separate library / classes 
doing this rather than directly integrating it into Lucene's class library.  
This suits me too as I can develop this as a separate library that other OS 
components can integrate and I can package this as part of nVidia's OS 
libraries.
 * I'm open to other alternatives - I scanned the ideas above but didn't 
consider them as they would not bring massive value to the users and I don't 
really want to experiment as I know what I'm doing.
 * Related to the previous point, I don't know Lucene (Help!! - do I really 
need to?) and will need support/hand-holding in terms of reviewing the 
identification/interfacing/design/code etc., etc.,
 * Finally, this IS GOING TO take time because thinking (and programming) 
massively parallel is completely different from writing a simple sequential 
search and sort.  How much time, think 7-10x at least given all my constraints.

If you guys like, I can write a brief (one or two paras) description of what is 
possible for indexing, searching, filtering (with zero knowledge of Lucene of 
course) to start off...

Your thoughts please...

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org