Re: [pu-branch] Deposit submission upload

2014-03-19 Thread Tibor Simko
On Tue, 18 Mar 2014, Lars Holm Nielsen wrote:
 I think both approaches have the same vulnerability? In both cases a
 recid is generated no matter what - in one case synchronously, in the
 other case asynchronously.

In my example, the deposition system could generate thousands of
bibupload jobs, but record IDs would not be created until the jobs are
run; so if the OPS team notices the situation in time and stops the task
queue and kills those still-waiting jobs, there would be less damage.

Best regards
-- 
Tibor Simko


Re: [pu-branch] Deposit submission upload

2014-03-18 Thread Lars Holm Nielsen

On 17 Mar 2014, at 17:42, Pedro Miguel Paiva Gaudencio 
pedro.gauden...@cern.ch wrote:

 Hi there,
 
 Add the recid to the sip.metadata (which create_recid() will do for you). If 
 you jsonalchemy is setup correctly, it should include 001 tag with the recid 
 in the generated marcxml available in sip.package, and upload_record_sip will 
 run bibupload -r. This you will usually just run after the user hits submit, 
 and thus there's not much difference from waiting until bibupload runs, 
 except making your life easier, by already having the link.
 
 Yeah, probably my jsonalchemy is not doing something, because the recid is 
 not included in the generated marcxml.

Did you install the invenio-demosite?

 
 No. First generate recid, stick in sip.metadata, then generate marcxml and 
 stick in sip.package, then bibupload -r. The sip can't and shouldn't be 
 edited after it's been given to bibupload.
 
 Ok, that's done except for including the recid in the marcxml.
 
 I tried Tibor's approach, but the thing is that bibupload delays the 
 records...I mean, the recid would be successfully assigned after the record 
 is created (in theory), but since bibupload/bibsched schedules their 
 insertion, they stand waiting on queue and when I query the last record 
 inserted it usually isn't the last submitted one.

Yes, I think you should only use Tibor’s approach if you don’t need the recid 
afterwards. Querying for the last recid is an unreliable way to obtain it.

You can test it like this:

from invenio.modules.records.api import Record
r = Record.create({‘recid': 1234}, 'json')
r.produce('json_for_marc’)

This should give you something like this:

[{'005': '20140318071429.0'}, {'001': 1234}]

Cheers,
Lars

 
 Anyway, I think I'll take a look on the jsonalchemy stuff to check why the 
 recid isn't being included in the marcxml.
 
 Cheers,
 Pedro



smime.p7s
Description: S/MIME cryptographic signature


RE: [pu-branch] Deposit submission upload

2014-03-18 Thread Pedro Miguel Paiva Gaudencio
Did you install the invenio-demosite?

Yep, you saw it running.

Yes, I think you should only use Tibor’s approach if you don’t need the recid 
afterwards. Querying for the last recid is an unreliable way to obtain it.

You can test it like this:

from invenio.modules.records.api import Record
r = Record.create({‘recid': 1234}, 'json')
r.produce('json_for_marc’)

This should give you something like this:

[{'005': '20140318071429.0'}, {'001': 1234}]

Yeah, so definitely something's wrong cause it returns me an empty list []...

Cheers,
Pedro


Re: [pu-branch] Deposit submission upload

2014-03-18 Thread Tibor Simko
On Mon, 17 Mar 2014, Pedro Miguel Paiva Gaudencio wrote:
 I tried Tibor's approach, but the thing is that bibupload delays the
 records...I mean, the recid would be successfully assigned after the
 record is created (in theory), but since bibupload/bibsched schedules
 their insertion, they stand waiting on queue and when I query the last
 record inserted it usually isn't the last submitted one.

A post-upload hook would take care of updating record ID - SIP
relationship in this approach.  Kind of like bst_send_email.py takes
care to email the original submitter of a document only after the record
is fully ingested.

Best regards
-- 
Tibor Simko


RE: [pu-branch] Deposit submission upload

2014-03-18 Thread Pedro Miguel Paiva Gaudencio
I got it working now, I used the guess_legacy_field_names() to check what was 
the correct field for the rule '001', and it worked with '_id' so I just added 
it based on the sip.metadata['recid'] while making the record (and before 
producing the marcxml).

A post-upload hook would take care of updating record ID - SIP
relationship in this approach.  Kind of like bst_send_email.py takes
care to email the original submitter of a document only after the record
is fully ingested.

I gave preference to the other approach, seemed less complex and with less post 
work to do.

Thank you for all the help and suggestions,
Pedro


Re: [pu-branch] Deposit submission upload

2014-03-18 Thread Tibor Simko
On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote:
 I gave preference to the other approach, seemed less complex and with
 less post work to do.

The major drawback of this approach is vulnerability: if this submission
remains open to guests, as most existing INSPIRE submissions are, then
watch for script kiddies playing DoS games, or for some automated script
going mad, etc.

Best regards
-- 
Tibor Simko


RE: [pu-branch] Deposit submission upload

2014-03-18 Thread Pedro Miguel Paiva Gaudencio
I think it's supposed to exist some middle step in the workflow before the 
upload task where a human reviews and accepts/rejects the deposition, so this 
should be fine. Of course, some extra work could be computed to help run the 
process.

Cheers,
Pedro

From: Tibor Simko
Sent: Tuesday, March 18, 2014 5:33 PM
To: Pedro Miguel Paiva Gaudencio
Cc: Lars Holm Nielsen; Javier Martin Montull; project-invenio-devel (Invenio 
developers mailing-list)
Subject: Re: [pu-branch] Deposit submission upload

On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote:
 I gave preference to the other approach, seemed less complex and with
 less post work to do.

The major drawback of this approach is vulnerability: if this submission
remains open to guests, as most existing INSPIRE submissions are, then
watch for script kiddies playing DoS games, or for some automated script
going mad, etc.

Best regards
--
Tibor Simko


Re: [pu-branch] Deposit submission upload

2014-03-18 Thread Tibor Simko
On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote:
 I think it's supposed to exist some middle step in the workflow before
 the upload task where a human reviews and accepts/rejects the
 deposition, so this should be fine. 

If record ID is allocated just before the very last workflow step, 
not automatically for every submission hit, then indeed it should do.

Best regards
-- 
Tibor Simko


Re: [pu-branch] Deposit submission upload

2014-03-18 Thread Lars Holm Nielsen
On 18 Mar 2014, at 17:33, Tibor Simko tibor.si...@cern.ch wrote:

 On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote:
 I gave preference to the other approach, seemed less complex and with
 less post work to do.
 
 The major drawback of this approach is vulnerability: if this submission
 remains open to guests, as most existing INSPIRE submissions are, then
 watch for script kiddies playing DoS games, or for some automated script
 going mad, etc.

I think both approaches have the same vulnerability? In both cases a recid is 
generated no matter what - in one case synchronously, in the other case 
asynchronously. Properly a captcha would be a good idea for INSPIRE submissions?

Cheers,
Lars


 
 Best regards
 -- 
 Tibor Simko



smime.p7s
Description: S/MIME cryptographic signature


RE: [pu-branch] Deposit submission upload

2014-03-18 Thread Pedro Miguel Paiva Gaudencio
A captcha is probably the best solution, I'll speak to Javier and let him know.

Cheers,
Pedro

From: Lars Holm Nielsen
Sent: Tuesday, March 18, 2014 9:20 PM
To: Tibor Simko
Cc: Pedro Miguel Paiva Gaudencio; Javier Martin Montull; project-invenio-devel 
(Invenio developers mailing-list)
Subject: Re: [pu-branch] Deposit submission upload

On 18 Mar 2014, at 17:33, Tibor Simko tibor.si...@cern.ch wrote:

 On Tue, 18 Mar 2014, Pedro Miguel Paiva Gaudencio wrote:
 I gave preference to the other approach, seemed less complex and with
 less post work to do.

 The major drawback of this approach is vulnerability: if this submission
 remains open to guests, as most existing INSPIRE submissions are, then
 watch for script kiddies playing DoS games, or for some automated script
 going mad, etc.

I think both approaches have the same vulnerability? In both cases a recid is 
generated no matter what - in one case synchronously, in the other case 
asynchronously. Properly a captcha would be a good idea for INSPIRE submissions?

Cheers,
Lars



 Best regards
 --
 Tibor Simko



Re: [pu-branch] Deposit submission upload

2014-03-17 Thread Lars Holm Nielsen

Hi Pedro,

1) Generating recid prior to upload:
It all depends on the workflow and what you else you need to do. E.g. in 
Zenodo I need to know the recid prior to uploading, because I use the 
recid to generate a DOI which goes into the marcxml. Also, knowing the 
recid before bibupload runs, allows me to quickly generate a preview and 
record link to the soon to be uploaded record which I can display to the 
end-user right after they hit submit. In another workflow, it might be 
fine not to know the recid until after bibupload has been running.


2) JSONAlchemy: All the workflows should be moved to invenio-demosite, 
where you should have the recid 
(https://github.com/inveniosoftware/invenio-demosite/blob/pu/invenio_demosite/recordext/fields/atlantis.cfg#L698). 
It's WIP at the moment, and Esteban should soon have some changes coming 
in for JSONAlchemy.
I.e. you should install invenio-demosite on top of Invenio as well, and 
we should move the workflows out of Invenio to invenio-demosite.


Does that answers your questions?

Cheers,
Lars
On 14.03.2014 17:42, Pedro Miguel Paiva Gaudencio wrote:

Hi Lars,

I got the deposit submission upload thingy working, just some things 
left (I think/hope): the marcxml is generated without the 001 (record 
id - bibupload runs in -r mode in upload_record_sip() and fails 
because the recid was previously created) and 980(collection 
information [article, book, preprint, report, etc] - which 
hides the record by default) fields.


I understood that the recid it's not supposed to be present in the new 
records' marcxml, but if I don't generate the recid 
(reserved_recid()and create_recid()) the workflow will fail when he 
gets to run_tasks().


I also understood (not quite sure if I'm right) that when we upload 
the new deposition, it will be generated a marcxml file from the json 
that the sip contained.


I checked the jsonalchemy.get_producer_rules() and it does not contain 
any rule for the 'recid', and so this is pobably why it's not being 
generated (from the json) along with the rest of the xml (on 
jsonalchemy.wrappers.legacy_export_as_marc()).


For the upload of new records to work peacefully we need to:

  * add the 001 (adding rules for 'recid' in the producer rules?) and
980 fields to the marcxml?
  * add only the 980 field and always upload_record_sip() in -i mode?


Do we need the recid already reserved and created in the sips for the 
new records before the upload (since when a new record is inserted by 
bibupload a recid is created for that record)? If so, why?



This is my workflow (note that I'm only uploading new records and 
never editing existing submissions):


 1. prefill_draft(draft_id='default'),
 2. render_form(draft_id='default'),
 3. prepare_sip(),
 4. reserved_recid(),
 5. create_recid(),
 6. process_sip_metadata(process_recjson_new),
 7. finalize_record_sip(),
 8. upload_record_sip(),
 9. run_tasks(update=False)


Sorry about the extensive reading.

Thanks in advance,
Pedro



--
Lars Holm Nielsen
CERN, IT Department, Collaboration  Information Services
http://zenodo.org | Tel: +41 22 76 79182 | Cel: +41 76 672 8927



smime.p7s
Description: S/MIME Cryptographic Signature


RE: [pu-branch] Deposit submission upload

2014-03-17 Thread Pedro Miguel Paiva Gaudencio
Hi Lars,

Yes, perfectly! I got it working. What I'm doing is not creating the recid 
prior to the upload and let it generate itself wih bibupload -i.  The record is 
peacefully created, the thing I haven't thought was how to link the deposition 
with the record so it's edited afterwards...Is there a way to get the last 
recid generated? Perhaps adding the recid to the sip would solve it.

I also tried reserving the recid in the sip prior to the bibupload, but then 
when the marcxml is uploaded the preview points to the wrong record (the 
previous one, of course), because bibupload ran with -i since the recid wasn't 
present in the xml.

So, should I add the recid to the sip after the marcxml is uploaded or create 
an empty dummy record prior to the bibupload and then just update it in the 
bibupload?

Cheers,
Pedro


From: Lars Holm Nielsen
Sent: Monday, March 17, 2014 8:40 AM
To: Pedro Miguel Paiva Gaudencio
Cc: Javier Martin Montull; project-invenio-devel (Invenio developers 
mailing-list)
Subject: Re: [pu-branch] Deposit submission upload

Hi Pedro,

1) Generating recid prior to upload:
It all depends on the workflow and what you else you need to do. E.g. in Zenodo 
I need to know the recid prior to uploading, because I use the recid to 
generate a DOI which goes into the marcxml. Also, knowing the recid before 
bibupload runs, allows me to quickly generate a preview and record link to the 
soon to be uploaded record which I can display to the end-user right after they 
hit submit. In another workflow, it might be fine not to know the recid until 
after bibupload has been running.

2) JSONAlchemy: All the workflows should be moved to invenio-demosite, where 
you should have the recid 
(https://github.com/inveniosoftware/invenio-demosite/blob/pu/invenio_demosite/recordext/fields/atlantis.cfg#L698).
 It's WIP at the moment, and Esteban should soon have some changes coming in 
for JSONAlchemy.
I.e. you should install invenio-demosite on top of Invenio as well, and we 
should move the workflows out of Invenio to invenio-demosite.

Does that answers your questions?

Cheers,
Lars
On 14.03.2014 17:42, Pedro Miguel Paiva Gaudencio wrote:
Hi Lars,

I got the deposit submission upload thingy working, just some things left (I 
think/hope): the marcxml is generated without the 001 (record id - bibupload 
runs in -r mode in upload_record_sip() and fails because the recid was 
previously created) and 980 (collection information [article, book, 
preprint, report, etc] - which hides the record by default) fields.

I understood that the recid it's not supposed to be present in the new records' 
marcxml, but if I don't generate the recid (reserved_recid() and 
create_recid()) the workflow will fail when he gets to run_tasks().

I also understood (not quite sure if I'm right) that when we upload the new 
deposition, it will be generated a marcxml file from the json that the sip 
contained.

I checked the jsonalchemy.get_producer_rules() and it does not contain any rule 
for the 'recid', and so this is pobably why it's not being generated (from the 
json) along with the rest of the xml (on 
jsonalchemy.wrappers.legacy_export_as_marc()).

For the upload of new records to work peacefully we need to:

  *   add the 001 (adding rules for 'recid' in the producer rules?) and 980 
fields to the marcxml?
  *   add only the 980 field and always upload_record_sip() in -i mode?

Do we need the recid already reserved and created in the sips for the new 
records before the upload (since when a new record is inserted by bibupload a 
recid is created for that record)? If so, why?


This is my workflow (note that I'm only uploading new records and never editing 
existing submissions):

  1.  prefill_draft(draft_id='default'),
  2.  render_form(draft_id='default'),
  3.  prepare_sip(),
  4.  reserved_recid(),
  5.  create_recid(),
  6.  process_sip_metadata(process_recjson_new),
  7.  finalize_record_sip(),
  8.  upload_record_sip(),
  9.  run_tasks(update=False)

Sorry about the extensive reading.

Thanks in advance,
Pedro



--
Lars Holm Nielsen
CERN, IT Department, Collaboration  Information Services
http://zenodo.org | Tel: +41 22 76 79182 | Cel: +41 76 672 8927



Re: [pu-branch] Deposit submission upload

2014-03-17 Thread Tibor Simko
On Mon, 17 Mar 2014, Pedro Miguel Paiva Gaudencio wrote:
 So, should I add the recid to the sip after the marcxml is uploaded or
 create an empty dummy record prior to the bibupload and then just
 update it in the bibupload?

I'd say the former -- even though this has a philosophical disadvantage
in that sip, in its sense of submission information package, would
usually contain stuff submitted by the user, which record ID is not.[1]

The latter option, always generating empty placeholder records, has a
theoretical disadvantage in case people start many unfinished
submissions (e.g. distributed submission attack), which would consume
lots of record IDs unnecessarily.

[1] Storing sips as true sips, without any auto-generated
information such as record ID, and linking them to records via
persistent ID store, would probably be nicer strategy.

Best regards
-- 
Tibor Simko