Re: [pu-branch] Deposit submission upload
On Tue, 18 Mar 2014, Lars Holm Nielsen wrote: I think both approaches have the same vulnerability? In both cases a recid is generated no matter what - in one case synchronously, in the other case asynchronously. In my example, the deposition system could generate thousands of bibupload jobs, but record IDs would not be created until the jobs are run; so if the OPS team notices the situation in time and stops the task queue and kills those still-waiting jobs, there would be less damage. Best regards -- Tibor Simko
Re: [pu-branch] Deposit submission upload
On 17 Mar 2014, at 17:42, Pedro Miguel Paiva Gaudencio pedro.gauden...@cern.ch wrote: Hi there, Add the recid to the sip.metadata (which create_recid() will do for you). If you jsonalchemy is setup correctly, it should include 001 tag with the recid in the generated marcxml available in sip.package, and upload_record_sip will run bibupload -r. This you will usually just run after the user hits submit, and thus there's not much difference from waiting until bibupload runs, except making your life easier, by already having the link. Yeah, probably my jsonalchemy is not doing something, because the recid is not included in the generated marcxml. Did you install the invenio-demosite? No. First generate recid, stick in sip.metadata, then generate marcxml and stick in sip.package, then bibupload -r. The sip can't and shouldn't be edited after it's been given to bibupload. Ok, that's done except for including the recid in the marcxml. I tried Tibor's approach, but the thing is that bibupload delays the records...I mean, the recid would be successfully assigned after the record is created (in theory), but since bibupload/bibsched schedules their insertion, they stand waiting on queue and when I query the last record inserted it usually isn't the last submitted one. Yes, I think you should only use Tibor’s approach if you don’t need the recid afterwards. Querying for the last recid is an unreliable way to obtain it. You can test it like this: from invenio.modules.records.api import Record r = Record.create({‘recid': 1234}, 'json') r.produce('json_for_marc’) This should give you something like this: [{'005': '20140318071429.0'}, {'001': 1234}] Cheers, Lars Anyway, I think I'll take a look on the jsonalchemy stuff to check why the recid isn't being included in the marcxml. Cheers, Pedro smime.p7s Description: S/MIME cryptographic signature
RE: [pu-branch] Deposit submission upload
Did you install the invenio-demosite? Yep, you saw it running. Yes, I think you should only use Tibor’s approach if you don’t need the recid afterwards. Querying for the last recid is an unreliable way to obtain it. You can test it like this: from invenio.modules.records.api import Record r = Record.create({‘recid': 1234}, 'json') r.produce('json_for_marc’) This should give you something like this: [{'005': '20140318071429.0'}, {'001': 1234}] Yeah, so definitely something's wrong cause it returns me an empty list []... Cheers, Pedro
Re: [pu-branch] Deposit submission upload
On Mon, 17 Mar 2014, Pedro Miguel Paiva Gaudencio wrote: I tried Tibor's approach, but the thing is that bibupload delays the records...I mean, the recid would be successfully assigned after the record is created (in theory), but since bibupload/bibsched schedules their insertion, they stand waiting on queue and when I query the last record inserted it usually isn't the last submitted one. A post-upload hook would take care of updating record ID - SIP relationship in this approach. Kind of like bst_send_email.py takes care to email the original submitter of a document only after the record is fully ingested. Best regards -- Tibor Simko
RE: [pu-branch] Deposit submission upload
I got it working now, I used the guess_legacy_field_names() to check what was the correct field for the rule '001', and it worked with '_id' so I just added it based on the sip.metadata['recid'] while making the record (and before producing the marcxml). A post-upload hook would take care of updating record ID - SIP relationship in this approach. Kind of like bst_send_email.py takes care to email the original submitter of a document only after the record is fully ingested. I gave preference to the other approach, seemed less complex and with less post work to do. Thank you for all the help and suggestions, Pedro
Re: [pu-branch] Deposit submission upload
On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote: I gave preference to the other approach, seemed less complex and with less post work to do. The major drawback of this approach is vulnerability: if this submission remains open to guests, as most existing INSPIRE submissions are, then watch for script kiddies playing DoS games, or for some automated script going mad, etc. Best regards -- Tibor Simko
RE: [pu-branch] Deposit submission upload
I think it's supposed to exist some middle step in the workflow before the upload task where a human reviews and accepts/rejects the deposition, so this should be fine. Of course, some extra work could be computed to help run the process. Cheers, Pedro From: Tibor Simko Sent: Tuesday, March 18, 2014 5:33 PM To: Pedro Miguel Paiva Gaudencio Cc: Lars Holm Nielsen; Javier Martin Montull; project-invenio-devel (Invenio developers mailing-list) Subject: Re: [pu-branch] Deposit submission upload On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote: I gave preference to the other approach, seemed less complex and with less post work to do. The major drawback of this approach is vulnerability: if this submission remains open to guests, as most existing INSPIRE submissions are, then watch for script kiddies playing DoS games, or for some automated script going mad, etc. Best regards -- Tibor Simko
Re: [pu-branch] Deposit submission upload
On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote: I think it's supposed to exist some middle step in the workflow before the upload task where a human reviews and accepts/rejects the deposition, so this should be fine. If record ID is allocated just before the very last workflow step, not automatically for every submission hit, then indeed it should do. Best regards -- Tibor Simko
Re: [pu-branch] Deposit submission upload
On 18 Mar 2014, at 17:33, Tibor Simko tibor.si...@cern.ch wrote: On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote: I gave preference to the other approach, seemed less complex and with less post work to do. The major drawback of this approach is vulnerability: if this submission remains open to guests, as most existing INSPIRE submissions are, then watch for script kiddies playing DoS games, or for some automated script going mad, etc. I think both approaches have the same vulnerability? In both cases a recid is generated no matter what - in one case synchronously, in the other case asynchronously. Properly a captcha would be a good idea for INSPIRE submissions? Cheers, Lars Best regards -- Tibor Simko smime.p7s Description: S/MIME cryptographic signature
RE: [pu-branch] Deposit submission upload
A captcha is probably the best solution, I'll speak to Javier and let him know. Cheers, Pedro From: Lars Holm Nielsen Sent: Tuesday, March 18, 2014 9:20 PM To: Tibor Simko Cc: Pedro Miguel Paiva Gaudencio; Javier Martin Montull; project-invenio-devel (Invenio developers mailing-list) Subject: Re: [pu-branch] Deposit submission upload On 18 Mar 2014, at 17:33, Tibor Simko tibor.si...@cern.ch wrote: On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote: I gave preference to the other approach, seemed less complex and with less post work to do. The major drawback of this approach is vulnerability: if this submission remains open to guests, as most existing INSPIRE submissions are, then watch for script kiddies playing DoS games, or for some automated script going mad, etc. I think both approaches have the same vulnerability? In both cases a recid is generated no matter what - in one case synchronously, in the other case asynchronously. Properly a captcha would be a good idea for INSPIRE submissions? Cheers, Lars Best regards -- Tibor Simko
Re: [pu-branch] Deposit submission upload
Hi Pedro, 1) Generating recid prior to upload: It all depends on the workflow and what you else you need to do. E.g. in Zenodo I need to know the recid prior to uploading, because I use the recid to generate a DOI which goes into the marcxml. Also, knowing the recid before bibupload runs, allows me to quickly generate a preview and record link to the soon to be uploaded record which I can display to the end-user right after they hit submit. In another workflow, it might be fine not to know the recid until after bibupload has been running. 2) JSONAlchemy: All the workflows should be moved to invenio-demosite, where you should have the recid (https://github.com/inveniosoftware/invenio-demosite/blob/pu/invenio_demosite/recordext/fields/atlantis.cfg#L698). It's WIP at the moment, and Esteban should soon have some changes coming in for JSONAlchemy. I.e. you should install invenio-demosite on top of Invenio as well, and we should move the workflows out of Invenio to invenio-demosite. Does that answers your questions? Cheers, Lars On 14.03.2014 17:42, Pedro Miguel Paiva Gaudencio wrote: Hi Lars, I got the deposit submission upload thingy working, just some things left (I think/hope): the marcxml is generated without the 001 (record id - bibupload runs in -r mode in upload_record_sip() and fails because the recid was previously created) and 980(collection information [article, book, preprint, report, etc] - which hides the record by default) fields. I understood that the recid it's not supposed to be present in the new records' marcxml, but if I don't generate the recid (reserved_recid()and create_recid()) the workflow will fail when he gets to run_tasks(). I also understood (not quite sure if I'm right) that when we upload the new deposition, it will be generated a marcxml file from the json that the sip contained. I checked the jsonalchemy.get_producer_rules() and it does not contain any rule for the 'recid', and so this is pobably why it's not being generated (from the json) along with the rest of the xml (on jsonalchemy.wrappers.legacy_export_as_marc()). For the upload of new records to work peacefully we need to: * add the 001 (adding rules for 'recid' in the producer rules?) and 980 fields to the marcxml? * add only the 980 field and always upload_record_sip() in -i mode? Do we need the recid already reserved and created in the sips for the new records before the upload (since when a new record is inserted by bibupload a recid is created for that record)? If so, why? This is my workflow (note that I'm only uploading new records and never editing existing submissions): 1. prefill_draft(draft_id='default'), 2. render_form(draft_id='default'), 3. prepare_sip(), 4. reserved_recid(), 5. create_recid(), 6. process_sip_metadata(process_recjson_new), 7. finalize_record_sip(), 8. upload_record_sip(), 9. run_tasks(update=False) Sorry about the extensive reading. Thanks in advance, Pedro -- Lars Holm Nielsen CERN, IT Department, Collaboration Information Services http://zenodo.org | Tel: +41 22 76 79182 | Cel: +41 76 672 8927 smime.p7s Description: S/MIME Cryptographic Signature
RE: [pu-branch] Deposit submission upload
Hi Lars, Yes, perfectly! I got it working. What I'm doing is not creating the recid prior to the upload and let it generate itself wih bibupload -i. The record is peacefully created, the thing I haven't thought was how to link the deposition with the record so it's edited afterwards...Is there a way to get the last recid generated? Perhaps adding the recid to the sip would solve it. I also tried reserving the recid in the sip prior to the bibupload, but then when the marcxml is uploaded the preview points to the wrong record (the previous one, of course), because bibupload ran with -i since the recid wasn't present in the xml. So, should I add the recid to the sip after the marcxml is uploaded or create an empty dummy record prior to the bibupload and then just update it in the bibupload? Cheers, Pedro From: Lars Holm Nielsen Sent: Monday, March 17, 2014 8:40 AM To: Pedro Miguel Paiva Gaudencio Cc: Javier Martin Montull; project-invenio-devel (Invenio developers mailing-list) Subject: Re: [pu-branch] Deposit submission upload Hi Pedro, 1) Generating recid prior to upload: It all depends on the workflow and what you else you need to do. E.g. in Zenodo I need to know the recid prior to uploading, because I use the recid to generate a DOI which goes into the marcxml. Also, knowing the recid before bibupload runs, allows me to quickly generate a preview and record link to the soon to be uploaded record which I can display to the end-user right after they hit submit. In another workflow, it might be fine not to know the recid until after bibupload has been running. 2) JSONAlchemy: All the workflows should be moved to invenio-demosite, where you should have the recid (https://github.com/inveniosoftware/invenio-demosite/blob/pu/invenio_demosite/recordext/fields/atlantis.cfg#L698). It's WIP at the moment, and Esteban should soon have some changes coming in for JSONAlchemy. I.e. you should install invenio-demosite on top of Invenio as well, and we should move the workflows out of Invenio to invenio-demosite. Does that answers your questions? Cheers, Lars On 14.03.2014 17:42, Pedro Miguel Paiva Gaudencio wrote: Hi Lars, I got the deposit submission upload thingy working, just some things left (I think/hope): the marcxml is generated without the 001 (record id - bibupload runs in -r mode in upload_record_sip() and fails because the recid was previously created) and 980 (collection information [article, book, preprint, report, etc] - which hides the record by default) fields. I understood that the recid it's not supposed to be present in the new records' marcxml, but if I don't generate the recid (reserved_recid() and create_recid()) the workflow will fail when he gets to run_tasks(). I also understood (not quite sure if I'm right) that when we upload the new deposition, it will be generated a marcxml file from the json that the sip contained. I checked the jsonalchemy.get_producer_rules() and it does not contain any rule for the 'recid', and so this is pobably why it's not being generated (from the json) along with the rest of the xml (on jsonalchemy.wrappers.legacy_export_as_marc()). For the upload of new records to work peacefully we need to: * add the 001 (adding rules for 'recid' in the producer rules?) and 980 fields to the marcxml? * add only the 980 field and always upload_record_sip() in -i mode? Do we need the recid already reserved and created in the sips for the new records before the upload (since when a new record is inserted by bibupload a recid is created for that record)? If so, why? This is my workflow (note that I'm only uploading new records and never editing existing submissions): 1. prefill_draft(draft_id='default'), 2. render_form(draft_id='default'), 3. prepare_sip(), 4. reserved_recid(), 5. create_recid(), 6. process_sip_metadata(process_recjson_new), 7. finalize_record_sip(), 8. upload_record_sip(), 9. run_tasks(update=False) Sorry about the extensive reading. Thanks in advance, Pedro -- Lars Holm Nielsen CERN, IT Department, Collaboration Information Services http://zenodo.org | Tel: +41 22 76 79182 | Cel: +41 76 672 8927
Re: [pu-branch] Deposit submission upload
On Mon, 17 Mar 2014, Pedro Miguel Paiva Gaudencio wrote: So, should I add the recid to the sip after the marcxml is uploaded or create an empty dummy record prior to the bibupload and then just update it in the bibupload? I'd say the former -- even though this has a philosophical disadvantage in that sip, in its sense of submission information package, would usually contain stuff submitted by the user, which record ID is not.[1] The latter option, always generating empty placeholder records, has a theoretical disadvantage in case people start many unfinished submissions (e.g. distributed submission attack), which would consume lots of record IDs unnecessarily. [1] Storing sips as true sips, without any auto-generated information such as record ID, and linking them to records via persistent ID store, would probably be nicer strategy. Best regards -- Tibor Simko