[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877934#comment-16877934 ] Rinka Singh edited comment on LUCENE-7745 at 7/3/19 3:56 PM: - {quote}The basic idea is to compute sub-histograms in each thread block with each thread block accumulating into the local memory. Then, when each thread block finishes its workload, it atomically adds the result to global memory, reducing the overall amount of traffic to global memory.To increase throughput and reduce shared memory contention, the main contribution here is that they actually use R "replicated" sub-histograms in each thread block, and they offset them so that bin 0 of the 1st histogram falls into a different memory bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, it improves throughput in the degenerate case where multiple threads are trying to accumulate the same histogram bin at the same time. {quote} So here's what I've done/am doing: I have a basic histogramming (including eliminating stop words) working on a single GPU (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 5MB (text file) and it seems to be working OK. The following is how I'm implementing it - briefly. Read a file in from command line (linux executable) into the GPU * convert the stream to words, chunk them into blocks * eliminate the stop words * sort/merge (including word-count) everything first inside a block and then across blocks - I came up with my own sort - haven't had the time to explore the parallel sorts out there * This results in a sorted histogram is held in multiple blocks in the GPU. The advantages of this approach (to my mind) are: * i can scale up use the entire GPU memory. My guess is I can create and manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested with a 5 MB text file so far. * Easy to add fresh data into the existing histogram. All I need to do is create new blocks and sort/merge them all. * I'm guessing this should make it easy to implement scaling across GPUs which means on a multi-GPU machine, I can scale to the almost the number of GPUs there and then of course one can setup a cluster of such machines... This is far in the future though... * The sort is kept separate so we can experiment with various sorts and see which one performs best. The issues are: * It is currently horrendously slow (I use global memory all the way and no optimization). Well OK much too slow for my liking (I went over to nVidia's office and tested it on a K80 and it was just twice as fast as my GPU). I'm currently trying to implement a shared memory version (and a few other tweaks) that should speed it up. * I have yet to do comparisons with the histogramming tools out there and so cannot say how much better it is. Once I have the basic inverted index in place, I'll reach out to you all for the testing. * It is still a bit fragile - I'm still finding bugs as I test but the basic works. Currently in process: * code is modified for (some) performance. Am debugging/testing - it will take a while. As of now, I feel good about what I've done but I won't know till I get it to work and test for performance. * Need to add ability to handle multiple files (I think I will postpone this as one can always cat the files together and pass it in - that is a pretty simple script that can be wrapped around the executable). * Need to create inverted index. * we'll worry about searching on the index later but that should be pretty trivial - well actually nothing is trivial here. {quote}Re: efficient histogram implementation in CUDA If it helps, [this approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3] has been good for a balance between GPU performance and ease of implementation for work I've done in the past. If academic paywalls block you for all those results, it looks to also be available (presumably by the authors) on [researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU] {quote} Took a quick look - they are all priced products. I will take a look at researchgate sometime. I apologize but I may not be very responsive in the next month or so as we are in the middle of a release at work and also my night time job (this). was (Author: rinka): {quote}The basic idea is to compute sub-histograms in each thread block with each thread block accumulating into the local memory. Then, when each thread block finishes its workload, it atomically adds the result to global memory, reducing the overall amount of traffic to global memory.To increase throughput and reduce shared memory contention, the main contribution here is that they actually use R "replicated" sub-histograms in each threa
[jira] [Commented] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877934#comment-16877934 ] Rinka Singh commented on LUCENE-7745: - {quote}The basic idea is to compute sub-histograms in each thread block with each thread block accumulating into the local memory. Then, when each thread block finishes its workload, it atomically adds the result to global memory, reducing the overall amount of traffic to global memory.To increase throughput and reduce shared memory contention, the main contribution here is that they actually use R "replicated" sub-histograms in each thread block, and they offset them so that bin 0 of the 1st histogram falls into a different memory bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, it improves throughput in the degenerate case where multiple threads are trying to accumulate the same histogram bin at the same time. {quote} So here's what I've done/am doing: I have a basic histogramming (including eliminating stop words) working on a single GPU (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 5MB (text file) and it seems to be working OK. The following is how I'm implementing it - briefly. Read a file in from command line (linux executable) into the GPU * convert the stream to words, chunk them into blocks * eliminate the stop words * sort/merge (including word-count) everything first inside a block and then across blocks - I came up with my own sort - haven't had the time to explore the parallel sorts out there * This results in a sorted histogram is held in multiple blocks in the GPU. The advantages of this approach (to my mind) are: * i can scale up use the entire GPU memory. My guess is I can create and manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested with a 5 MB text file so far. * Easy to add fresh data into the existing histogram. All I need to do is create new blocks and sort/merge them all. * I'm guessing this should make it easy to implement scaling across GPUs which means on a multi-GPU machine, I can scale to the almost the number of GPUs there and then of course one can setup a cluster of such machines... This is far in the future though... * The sort is kept separate so we can experiment with various sorts and see which one performs best. The issues are: * It is currently horrendously slow (I use global memory all the way and no optimization). Well OK much too slow for my liking (I went over to nVidia's office and tested it on a K80 and it was just twice as fast as my GPU). I'm currently trying to implement a shared memory version (and a few other tweaks) that should speed it up. * I have yet to do comparisons with the histogramming tools out there and so cannot say how much better it is. Once I have the basic inverted index in place, I'll reach out to you all for the testing. * It is still a bit fragile - I'm still finding bugs as I test but the basic works. Currently in process: * code is modified for (some) performance. Am debugging/testing - it will take a while. As of now, I feel good about what I've done but I won't know till I test for performance. * Need to add ability to handle multiple files (I think I will postpone this as one can always cat the files together and pass it in - that is a pretty simple script that can be wrapped around the executable). * Need to create inverted index. * we'll worry about searching on the index later but that should be pretty trivial - well actually nothing is trivial here. {quote}Re: efficient histogram implementation in CUDA If it helps, [this approach|https://scholar.google.com/scholar?cluster=4154868272073145366&hl=en&as_sdt=0,3] has been good for a balance between GPU performance and ease of implementation for work I've done in the past. If academic paywalls block you for all those results, it looks to also be available (presumably by the authors) on [researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU] {quote} Took a quick look - they are all priced products. I will take a look at researchgate sometime. I apologize but I may not be very responsive in the next month or so as we are in the middle of a release at work and also my night time job (this). > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if comp
[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863938#comment-16863938 ] Rinka Singh edited comment on LUCENE-7745 at 6/19/19 10:34 AM: --- Hi [~mackncheesiest], All, A quick update. I've been going really slow (sorry 'bout that). My day job has consumed a lot of my time. Also, what I have working is histogramming (on text files) on GPUs - the problem with that is that it is horrendously slow - I use the GPU global memory all the way (it is just about 4-5 times faster than a CPU) instead of sorting in local memory. I've been trying to accelerate that before converting it into an inverted index. Nearly there :) you know how that is - the almost there syndrome... Once I get it done, I'll check it into my github. Here's the lessons I learned in my journey: # Do all the decision making in the CPU. See if parallelization can substitute for decision making - you need to think parallelization/optimization as part of design not as an after-thought - this is counter to what everyone says about optimization. The reason is there could be SIGNIFICANT design changes. # Do as fine grained parallelism as possible. Don't think - one cpu-thread == one gpu-thread. think as parallel as possible. # The best metaphor I found (of working with GPUs) - think of it as an embedded board attached to your machine and you move data to and fro from the board, debug on the board. Dump all parallel processing on the board and sequential on the CPU. # Read the nVidia Manuals (they are your best bet). I figured it is better to stay with CUDA (as against OpenCL) given the wealth of CUDA info and support out there... # Writing code: ## explicitly think about cache memory (that's your shared memory) and registers (local memory) and manage them. This is COMPLETELY different from writing CPU code - the compiler does this for you. ## Try to used const, shared and register memory as much as possible. Avoid __syncthreads () if you can. ## Here's where the dragons lie... # Engineer productivity is roughly 1/10th the normal productivity (And I mean writing C/C++ not python). I've written and thrown away code umpteen times - something that I just wouldn't need to do when writing standard code. # Added on 6/19: As a metaphor think about this as VHDL/Verilog programming (without the timing constructs - timing is not a major issue here since the threads on the device execute in (almost) lockstep). Having said all this, :) I've a bunch of limitations that a regular software engineer will not have and have been struggling to get over them. I've been a manager for way too long and find it really difficult to focus on just one thing (the standard ADHD that most managers eventually develop). Also, I WAS a C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth to pick C++ up and then let's not even talk about my day job pressures - I do this for an hour or two at night (sigh)... I will put out everything I've done once I've crossed a milestone (a working accelerated histogram). Then will modify that to do inverted indexing. Hope this helps... In the meantime, if you want me to review documents/design/thoughts/anything, please feel free to mail them to me at: rinka (dot) singh (at) gmail. At least ping me - I really don't look at the Apache messages and would probably miss something... Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So please forgive me if I get the processes and systems wrong... I'm trying to learn. was (Author: rinka): Hi [~mackncheesiest], All, A quick update. I've been going really slow (sorry 'bout that). My day job has consumed a lot of my time. Also, what I have working is histogramming (on text files) on GPUs - the problem with that is that it is horrendously slow - I use the GPU global memory all the way (it is just about 4-5 times faster than a CPU) instead of sorting in local memory. I've been trying to accelerate that before converting it into an inverted index. Nearly there :) you know how that is - the almost there syndrome... Once I get it done, I'll check it into my github. Here's the lessons I learned in my journey: # Do all the decision making in the CPU. See if parallelization can substitute for decision making - you need to think parallelization/optimization as part of design not as an after-thought - this is counter to what everyone says about optimization. The reason is there could be SIGNIFICANT design changes. # Do as fine grained parallelism as possible. Don't think - one cpu-thread == one gpu-thread. think as parallel as possible. # The best metaphor I found (of working with GPUs) - think of it as an embedded board attached to your machine and you move data to and fro from the board, debug on the board. Dump all
[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863938#comment-16863938 ] Rinka Singh edited comment on LUCENE-7745 at 6/14/19 10:43 AM: --- Hi [~mackncheesiest], All, A quick update. I've been going really slow (sorry 'bout that). My day job has consumed a lot of my time. Also, what I have working is histogramming (on text files) on GPUs - the problem with that is that it is horrendously slow - I use the GPU global memory all the way (it is just about 4-5 times faster than a CPU) instead of sorting in local memory. I've been trying to accelerate that before converting it into an inverted index. Nearly there :) you know how that is - the almost there syndrome... Once I get it done, I'll check it into my github. Here's the lessons I learned in my journey: # Do all the decision making in the CPU. See if parallelization can substitute for decision making - you need to think parallelization/optimization as part of design not as an after-thought - this is counter to what everyone says about optimization. The reason is there could be SIGNIFICANT design changes. # Do as fine grained parallelism as possible. Don't think - one cpu-thread == one gpu-thread. think as parallel as possible. # The best metaphor I found (of working with GPUs) - think of it as an embedded board attached to your machine and you move data to and fro from the board, debug on the board. Dump all parallel processing on the board and sequential on the CPU. # Read the nVidia Manuals (they are your best bet). I figured it is better to stay with CUDA (as against OpenCL) given the wealth of CUDA info and support out there... # Writing code: ## explicitly think about cache memory (that's your shared memory) and registers (local memory) and manage them. This is COMPLETELY different from writing CPU code - the compiler does this for you. ## Try to used const, shared and register memory as much as possible. Avoid __syncthreads () if you can. ## Here's where the dragons lie... # Engineer productivity is roughly 1/10th the normal productivity (And I mean writing C/C++ not python). I've written and thrown away code umpteen times - something that I just wouldn't need to do when writing standard code. Having said all this, :) I've a bunch of limitations that a regular software engineer will not have and have been struggling to get over them. I've been a manager for way too long and find it really difficult to focus on just one thing (the standard ADHD that most managers eventually develop). Also, I WAS a C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth to pick C++ up and then let's not even talk about my day job pressures - I do this for an hour or two at night (sigh)... I will put out everything I've done once I've crossed a milestone (a working accelerated histogram). Then will modify that to do inverted indexing. Hope this helps... In the meantime, if you want me to review documents/design/thoughts/anything, please feel free to mail them to me at: rinka (dot) singh (at) gmail. At least ping me - I really don't look at the Apache messages and would probably miss something... Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So please forgive me if I get the processes and systems wrong... I'm trying to learn. was (Author: rinka): Hi [~mackncheesiest], All, A quick update. I've been going really slow (sorry 'bout that). My day job has consumed a lot of my time. Also, what I have working is histogramming (on text files) on GPUs - the problem with that is that it is horrendously slow - I use the GPU global memory all the way (it is just about 4-5 times faster than a CPU) instead of sorting in local memory. I've been trying to accelerate that before converting it into an inverted index. Nearly there :) you know how that is - the almost there syndrome... Once I get it done, I'll check it into my github. Here's the lessons I learned in my journey: # Do all the decision making in the CPU. See if parallelization can substitute for decision making - you need to think parallelization/optimization as part of design not as an after-thought - this is counter to what everyone says about optimization. The reason is there could be SIGNIFICANT design changes. # Do as fine grained parallelism as possible. Don't think - one cpu-thread == one gpu-thread. think as parallel as possible. # The best metaphor I found (of working with GPUs) - think of it as an embedded board attached to your machine and you move data to and fro from the board, debug on the board. Dump all parallel processing on the board and sequential on the CPU. # Read the nVidia Manuals (they are your best bet). I figured it is better to stay with CUDA (as against OpenCL) given the wealth of CUDA info a
[jira] [Commented] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863938#comment-16863938 ] Rinka Singh commented on LUCENE-7745: - Hi [~mackncheesiest], All, A quick update. I've been going really slow (sorry 'bout that). My day job has consumed a lot of my time. Also, what I have working is histogramming (on text files) on GPUs - the problem with that is that it is horrendously slow - I use the GPU global memory all the way (it is just about 4-5 times faster than a CPU) instead of sorting in local memory. I've been trying to accelerate that before converting it into an inverted index. Nearly there :) you know how that is - the almost there syndrome... Once I get it done, I'll check it into my github. Here's the lessons I learned in my journey: # Do all the decision making in the CPU. See if parallelization can substitute for decision making - you need to think parallelization/optimization as part of design not as an after-thought - this is counter to what everyone says about optimization. The reason is there could be SIGNIFICANT design changes. # Do as fine grained parallelism as possible. Don't think - one cpu-thread == one gpu-thread. think as parallel as possible. # The best metaphor I found (of working with GPUs) - think of it as an embedded board attached to your machine and you move data to and fro from the board, debug on the board. Dump all parallel processing on the board and sequential on the CPU. # Read the nVidia Manuals (they are your best bet). I figured it is better to stay with CUDA (as against OpenCL) given the wealth of CUDA info and support out there... # Writing code: ## explicitly think about cache memory (that's your shared memory) and registers (local memory) and manage them. This is COMPLETELY different from writing CPU code - the compiler does this for you. ## Try to used const, shared and register memory as much as possible. Avoid __syncthreads () if you can. ## Here's where the dragons lie... # Engineer productivity is roughly 1/10th the normal productivity (And I mean writing C/C++ not python). I've written and thrown away code umpteen times - something that I just wouldn't need to do when writing standard code. Having said all this, :) I've a bunch of limitations that a regular software engineer will not have and have been struggling to get over them. I've been a manager for way too long and find it really difficult to focus on just one thing (the standard ADHD that most managers eventually develop). Also, I WAS a C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth to pick C++ up and then let's not even talk about my day job pressures - I do this for an hour or two at night (sigh)... I will put out everything I've done once I've crossed a milestone (a working accelerated histogram). Then will modify that to do inverted indexing. Hope this helps... In the meantime, if you want me to review documents/design/thoughts/anything, please feel free to mail them to me at: rinka (dot) singh (at) gmail. At least ping me - I really don't look at the Apache messages and would probably miss something... > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722717#comment-16722717 ] Rinka Singh commented on LUCENE-7745: - Thank you. As a first step let me come up with a GPU based index builder and we can look at query handling on the GPU as a second step. :-) I AM going to be slooow - my apologies but I'm interested enough to put effort into this and will do my best. Here's what I'll do as a first step: Develop a stand alone executable (we can figure out modifications to directly use in Lucene as step 1.1) that will: a. Read multiple files (command line) and come up with an inverted index b. write the inverted index to stdout (I can generate a lucene index as step 1.1) c. will handle a stop-words file as a command line param d. Will work on one GPU+1 thread of the CPU (I'll keep multi-GPU and multi threading to use all CPUs in mind but implementing that will be a separate step altogether). Goal: Look at speed difference between an Index generated on the CPU vs GPU for just this. We can build from there... Thoughts please... > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702135#comment-16702135 ] Rinka Singh edited comment on LUCENE-7745 at 11/28/18 4:57 PM: --- Edited. Sorry... A few questions. * How critical is the inverted index to the user experience? * What happens if the inverted index is speeded up? * How many AWS instances would usually be used for searching through ~140GB sized inverted index and are there any performance numbers around this? (I'd like to compare to a server with 8 GPUs costing about $135-140K) - not sure what the equivalent GPU instances on Google Cloud/AWS would cost... Assumptions (please validate): * Documents are being added to the inverted index however the Index itself doesn't grow rapidly * the Maximum Index size will be less than 140GB - I assume 8 GPUs was (Author: rinka): A few questions. How critical is the inverted index to the user experience? What happens if the inverted index is speeded up? > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702135#comment-16702135 ] Rinka Singh commented on LUCENE-7745: - A few questions. How critical is the inverted index to the user experience? What happens if the inverted index is speeded up? > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701990#comment-16701990 ] Rinka Singh edited comment on LUCENE-7745 at 11/28/18 3:08 PM: --- [~jpountz] {quote}(Unrelated to your comment Rinka, but seeing activity on this issue reminded me that I wanted to share something) There are limited use-cases for GPU accelelation in Lucene due to the fact that query processing is full of branches, especially since we added support for impacts and WAND.{quote} While Yes branches do impact the performance, well designed (GPU) code will consist of a combo of both CPU (the decision making part) and GPU code. For example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration and I also identified further code areas that can be improved. I'm fairly sure (gut feel), I can squeeze out a 40-50x kind of improvement at the very least on a mid-sized GPU (given the time etc.,). I think things will be much, much better on a high end GPU and with further scale-up on a multi-gpu system... My point is - thinking (GPU) parallel is a completely different ball-game and requires a mind-shift. Once that happens, the value add will be massive and gut tells me Lucene is a huge opportunity. Incidentally, this is why I want to develop a library that I can put out there for integration. {quote}That said Mike initially mentioned that BooleanScorer might be one scorer that could benefit from GPU acceleration as it scores large blocks of documents at once. I just attached a specialization of a disjunction over term queries that should make it easy to experiment with Cuda, see the TODO in the end on top of the computeScores method. {quote} Lucene is really new to me (and so is working with Apache - sorry, I am a newbie to Apache) :). Please will you put links here... was (Author: rinka): [~jpountz] {quote}(Unrelated to your comment Rinka, but seeing activity on this issue reminded me that I wanted to share something) There are limited use-cases for GPU accelelation in Lucene due to the fact that query processing is full of branches, especially since we added support for impacts and WAND.{quote} While Yes branches do impact the performance, well designed (GPU) code will consist of a combo of both CPU (the decision making part) and GPU code. For example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration and I also identified further code areas that can be improved. I'm fairly sure (gut feel), I can squeeze out a 40-50x kind of improvement at the very least on a mid-sized GPU (given the time etc.,). I think things will be much, much better on a high end GPU and with further scale-up on a multi-gpu system... Incidentally, this is why I want to develop a library that I can put out there for integration. {quote}That said Mike initially mentioned that BooleanScorer might be one scorer that could benefit from GPU acceleration as it scores large blocks of documents at once. I just attached a specialization of a disjunction over term queries that should make it easy to experiment with Cuda, see the TODO in the end on top of the computeScores method. {quote} Lucene is really new to me (and so is working with Apache - sorry, I am a newbie to Apache) :). Please will you put links here... > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701990#comment-16701990 ] Rinka Singh commented on LUCENE-7745: - [~jpountz] {quote}(Unrelated to your comment Rinka, but seeing activity on this issue reminded me that I wanted to share something) There are limited use-cases for GPU accelelation in Lucene due to the fact that query processing is full of branches, especially since we added support for impacts and WAND.{quote} While Yes branches do impact the performance, well designed (GPU) code will consist of a combo of both CPU (the decision making part) and GPU code. For example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration and I also identified further code areas that can be improved. I'm fairly sure (gut feel), I can squeeze out a 40-50x kind of improvement at the very least on a mid-sized GPU (given the time etc.,). I think things will be much, much better on a high end GPU and with further scale-up on a multi-gpu system... Incidentally, this is why I want to develop a library that I can put out there for integration. {quote}That said Mike initially mentioned that BooleanScorer might be one scorer that could benefit from GPU acceleration as it scores large blocks of documents at once. I just attached a specialization of a disjunction over term queries that should make it easy to experiment with Cuda, see the TODO in the end on top of the computeScores method. {quote} Lucene is really new to me (and so is working with Apache - sorry, I am a newbie to Apache) :). Please will you put links here... > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701976#comment-16701976 ] Rinka Singh commented on LUCENE-7745: - > The code is not worth a patch right now, but will soon have something. I > shall update on the latest state > here as soon as I find myself some time (winding down from a hectic Black > Friday/Cyber Monday support schedule). Do you think I could take a look at the code, I could do a quick review and perhaps add a bit of value. I'm fine if the code is in dev state. Would you have written up something to describe what you are doing? > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7745) Explore GPU acceleration
[ https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701855#comment-16701855 ] Rinka Singh commented on LUCENE-7745: - Hi everyone, I wanted to check if this issue was still open. I have been experimenting with CUDA for a bit and would love to take a stab at this. A few thoughts: * This is something I'll do over weekends and so I'm going to be horribly slow (its going to be just me on this unless you have someone working on it and I can collaborate with them) - would that be OK? * I think the right thing to do would be to build a CUDA library (C/C++), put JNI and then integrate it into Lucene. If done right then I think this library will be useful to (and be possible to integrate with) other Analytic tools. * If I get it right, then I'd love to create an OS library that other OS tools can integrate and use (Yes, I'm thinking of an OpenCL port in the future but given the tools available in CUDA and my familiarity with it...) * Licensing is not an issue as I prefer the Apache License. * Testing (especially scalability testing) will be an issue - like you said, your setups won't have GPUs but would it be possible to rent a few GPU instances on the cloud (AWS, Google)? I can do my dev testing locally as I have a GPU (its a pretty old and obsolete one but good enough for my needs) on my dev machine. * It is important to get a few users who will experiment with this. Can you guys help in having someone deploy, experiment and give feedback? * I would rather take something that is used by everyone and I'm thinking that indexing, filtering and searching is something that I would rather take up: [http://lucene.apache.org/core/7_5_0/demo/overview-summary.html#overview.description] ** These can certainly be accelerated. I think I should be able to get some acceleration out of a GPU enabled search. ** The good part of this is one would able to scale volumes almost linearly on a multi-GPU machine. ** Related to the previous point (though this is in the future). I don't have a multi-GPU setup and will not be able to develop multi-GPU versions. I'll need help in getting the infrastructure to do that. We can talk about that once a single GPU version is done. ** Yes I agree that it will be better to have a separate library / classes doing this rather than directly integrating it into Lucene's class library. This suits me too as I can develop this as a separate library that other OS components can integrate and I can package this as part of nVidia's OS libraries. * I'm open to other alternatives - I scanned the ideas above but didn't consider them as they would not bring massive value to the users and I don't really want to experiment as I know what I'm doing. * Related to the previous point, I don't know Lucene (Help!! - do I really need to?) and will need support/hand-holding in terms of reviewing the identification/interfacing/design/code etc., etc., * Finally, this IS GOING TO take time because thinking (and programming) massively parallel is completely different from writing a simple sequential search and sort. How much time, think 7-10x at least given all my constraints. If you guys like, I can write a brief (one or two paras) description of what is possible for indexing, searching, filtering (with zero knowledge of Lucene of course) to start off... Your thoughts please... > Explore GPU acceleration > > > Key: LUCENE-7745 > URL: https://issues.apache.org/jira/browse/LUCENE-7745 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya >Assignee: Ishan Chattopadhyaya >Priority: Major > Labels: gsoc2017, mentor > Attachments: gpu-benchmarks.png > > > There are parts of Lucene that can potentially be speeded up if computations > were to be offloaded from CPU to the GPU(s). With commodity GPUs having as > high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to > speed parts of Lucene (indexing, search). > First that comes to mind is spatial filtering, which is traditionally known > to be a good candidate for GPU based speedup (esp. when complex polygons are > involved). In the past, Mike McCandless has mentioned that "both initial > indexing and merging are CPU/IO intensive, but they are very amenable to > soaking up the hardware's concurrency." > I'm opening this issue as an exploratory task, suitable for a GSoC project. I > volunteer to mentor any GSoC student willing to work on this this summer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org