Many thanks for your time and consideration Mark!

All of your above is knowledge I built up tackling the aftermath of these
crashes (when the shared memory becomes mangled). This current one occurs
while recording antenna data: the hashpipe either created clean shared
memory attached itself to acceptably neat shared memory and came to full
operation without any hitches. So I don't think any of it is directly
pertinent (: Appreciate the eager and open communication though!

My understanding is that this error doesn't actually have to do with the
contents of the shared memory but rather the control of the shared memory
(namely the semaphores that act as access tokens right), hence the 'semctl'
and 'semop' errors.

I whittled down the code difference between unstable and stable threads and
seem to have found that the critical fix entailed reading values from the
databuffers' headers instead of the hashpipe's status header (which also
sits in shared memory). These values shouldn't actually be changing
throughout the duration of an observation. Is it possible that backend code
that doesn't enclose its shared memory access (hput*() or hget*()) between
hashpipe_status_un/lock_safe() would cause semaphore errors? This could be
reasonably deduced as logical as the semaphore error occurred when reading
from the hashpipe_status_buffer (handled primarily by external applications
and hashpipe backed, which generally share the same underlying code base)
while reading from the databuffers is stabler (which is handled by fewer
lines of code, and accessed almost exclusively by hpguppi_daq).

I hope that my attempts to keep the discussion as generic (as independent
from my application) as possible hasn't made things too vague.

Kind regards,
Ross

On Fri, Dec 18, 2020 at 12:40 AM Mark Ruzindana <ruziem...@gmail.com> wrote:

> Last note, also do it before you set any parameters in shared memory (in
> case you're doing so) otherwise they'll be cleared before you run hashpipe.
>
> Mark Ruzindana
>
> On Thu, Dec 17, 2020 at 3:37 PM Mark Ruzindana <ruziem...@gmail.com>
> wrote:
>
>> Just another note which is a little simpler. I haven't thought about this
>> in a while because I wrote a python interface that runs the
>> hashpipe executable and performs a few other actions for me, but you might
>> also just need to clear the shared memory with the "hashpipe_clean_shmem -I
>> *instance number*" command right before you run hashpipe.
>>
>> What I said before was still relevant, but your particular issue might be
>> this simple.
>>
>> Good luck again!
>>
>> Mark Ruzindana
>>
>> On Thu, Dec 17, 2020 at 11:31 AM Mark Ruzindana <ruziem...@gmail.com>
>> wrote:
>>
>>> Hi Ross,
>>>
>>> I think I have a solution to your problem. It's possible that it's
>>> something else or you might need/want to solve it in a different way, but
>>> this is how I do it.
>>>
>>> The problem is most likely because shared memory segments and semaphore
>>> arrays are being locked and when a change is made to the size of the
>>> semaphore controlled shared memory buffers or something else happens behind
>>> the scenes in hashpipe, the interprocess-communication segments (IPCS)
>>> can't be changed. I was unfamiliar with IPCS before this issue, and I
>>> recommend learning about it if you haven't already.
>>>
>>> So enter 'ipcs' into your terminal and you should see shared memory
>>> segments and semaphore arrays with the exact number of blocks that you
>>> initialized alongside your username. You might also see quite a few other
>>> processes depending on the activity on your server. Once you've spotted
>>> those segments and arrays, you'll need to remove them using their IDs. The
>>> commands to do so are:
>>>
>>> ipcrm -m 'shared memory ID'      -> This removes the shared memory
>>> segment corresponding to that ID.
>>> ipcrm -s 'semaphore ID'      -> This removes the semaphore
>>> array corresponding to that ID.
>>>
>>> I was also able to write a shell script that removes all of the segments
>>> and arrays associated with your account/username by running it once with no
>>> need to look at any IDs which simplifies the process quite a bit (entering
>>> one line in the terminal). But this may not be your issue so I can provide
>>> you with it if you need it. Or you can write your own, whichever works.
>>>
>>> Hopefully this helps. If this isn't your problem, maybe provide a few
>>> more details if you can, and maybe I can help.
>>>
>>> Good luck!
>>>
>>> Mark Ruzindana
>>>
>>> On Wed, Dec 16, 2020 at 11:08 PM Ross Andrew Donnachie <
>>> radonnac...@gmail.com> wrote:
>>>
>>>> Good day all,
>>>>
>>>> Been working on a hashpipe with a pipeline of network, transposition
>>>> and then disk-dump threads. We have 24 data-buffers that we rotate through.
>>>>
>>>> An inconsistent (happens after various amounts of time) crash occurs
>>>> with this printout:
>>>> -----------------------------------------------------
>>>> Tue Dec 15 17:37:19 2020 : Error (hashpipe_databuf_set_filled): semctl
>>>> error [Invalid argument]
>>>> Tue Dec 15 17:37:19 2020 : Error (hashpipe_databuf_wait_free_timeout):
>>>> semop error [Invalid argument]
>>>> semop: Invalid argument
>>>> Tue Dec 15 17:37:19 2020 : Error (hpguppi_atasnap_pktsock_thread):
>>>> error waiting for free databuf [Invalid argument]
>>>> Tue Dec 15 17:37:19 2020 : Error (hashpipe_databuf_set_free): semctl
>>>> error [Invalid argument]
>>>> Tue Dec 15 17:37:19 2020 : Error
>>>> (hashpipe_databuf_wait_filled_timeout): semop error [Invalid argument]
>>>> semop: Invalid argument
>>>> Tue Dec 15 17:37:19 2020 : Error
>>>> (hpguppi_atasnap_pkt_to_FTP_transpose): error waiting for input buffer, rv:
>>>> -2 [Invalid argument]
>>>> -----------------------------------------------------------
>>>>
>>>> If this looks at all familiar to anyone then let's tackle the issue at
>>>> this juncture. If not, then there is a little more to tell!
>>>>
>>>> Other times an error is caught but no full printout from
>>>> hashpipe_error() is made:
>>>>
>>>> Code calls:
>>>> ++++++++++++++++++++++++++++
>>>> hpguppi_databuf_data(struct hpguppi_input_databuf *d, int block_id) {
>>>>     if(block_id < 0 || d->header.n_block < block_id) {
>>>>         hashpipe_error(__FUNCTION__,
>>>>             "block_id %s out of range [0, %d)",
>>>>             block_id, d->header.n_block);
>>>>         return NULL;
>>>> ....
>>>> ++++++++++++++++++++++++++++
>>>>
>>>> Printout:
>>>> ============
>>>> Tue Dec 15 17:37:19 2020 : Error
>>>> (hpguppi_databuf_data)~/src/hpguppi_daq/src:
>>>> ============
>>>>
>>>> Only once have I seen the above printout complete showing that
>>>> d->header.n_block = -23135124... Which indicates some deep rooted rot
>>>> somewhere.
>>>>
>>>> I have made some small changes that have gotten us past the occurrence
>>>> of this issue. I'll be retracing the differences to find the critical fix.
>>>> At that point I'll share.
>>>>
>>>> Kind Regards,
>>>> Ross
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "casper@lists.berkeley.edu" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to casper+unsubscr...@lists.berkeley.edu.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/bd10cd82-63d1-4d3b-bf4e-78903471df59n%40lists.berkeley.edu
>>>> <https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/bd10cd82-63d1-4d3b-bf4e-78903471df59n%40lists.berkeley.edu?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups "
> casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to casper+unsubscr...@lists.berkeley.edu.
> To view this discussion on the web visit
> https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CA%2B41hpwtMKJoy3WAapcLLR%2BWE%3DdV0NBzwxuu8PCgnto9dZ3jFA%40mail.gmail.com
> <https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CA%2B41hpwtMKJoy3WAapcLLR%2BWE%3DdV0NBzwxuu8PCgnto9dZ3jFA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to casper+unsubscr...@lists.berkeley.edu.
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CANrFTuTDt7FsPJtGVmRFVaDFgxNiXpvSjnKCpJ2-mVqp3ZjSuQ%40mail.gmail.com.

Reply via email to