Re: [MarkLogic Dev General] mlcp ability to skip corrupt zip files?

2015-07-22 Thread Justin Makeig
Can you please contact MarkLogic Support 
mailto:supp...@marklogic.com>> on this? There might be a 
bug here.

I'll be sure to update this list with what we find.

Thanks.

Justin

On Jul 21, 2015, at 9:58 AM, Morales-Martin, Kristina 
mailto:kmorales-mar...@cas.org>> wrote:


Dear all,

We are using the MarkLogic Content Pump to push content from many directories 
that have zip files that in turn contain .xml files.
>From the last communication with Geet, we are also using the transform option 
>in order to ingest only xml content.  This suggested filtering approach
using a transform works.

Unfortunately, when mlcp encounters a corrupt zip file (which we possibly can 
get from our sources),
the process terminates.  Is there an option to instruct mlcp to keep going, 
that is, to skip the corrupt skip file, and continue processing the large and
deeply nested directories for the rest of the zip files?  It looks like the 
-tolerate_errors option won’t work given that we need to use a transform to 
ingest only xml files,
and that forces the batch size to 1.

Please advise.

We are using the following options:
-input_file_path $inputFilePath \
-mode local -input_compressed true \
-output_uri_replace 
"(\/.+\/+)(?=.+\.zip),'/ourOverrideOfTheURIToRemoveTheLeadingNASPath/'" \
-output_collections "$collections" \
-database $dbName -output_permissions …
-transform_module /ourNamespace/ourTransformModule.xqy  \
-transform_namespace "http://cas.org/..."; \
-xml_repair_level full \

Thank you,

Kristina Morales-Martin
Sr. Technical Information Specialist, e-Content Operations
CAS, a division of the American Chemical Society
2540 Olentangy River Road
Columbus, OH 43202
Phone: 614-447-3600, ext. 2322
Fax: 614-447-3827
www.cas.org


Confidentiality Notice: This electronic message transmission, including any 
attachment(s), may contain confidential, proprietary, or privileged information 
from Chemical Abstracts Service (“CAS”), a division of the American Chemical 
Society (“ACS”). If you have received this transmission in error, be advised 
that any disclosure, copying, distribution, or use of the contents of this 
information is strictly prohibited. Please destroy all copies of the message 
and contact the sender immediately by either replying to this message or 
calling 614-447-3600.

___
General mailing list
General@developer.marklogic.com
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] mlcp ability to skip corrupt zip files?

2015-07-22 Thread Jason Hunter
MLCP comes with source. Should be a small edit to catch the exception and keep 
going. Maybe enabled with a flag. 

Sent from my iPhone

> On Jul 22, 2015, at 21:19, Geert Josten  wrote:
> 
> Hi Kristina,
> 
> I would have expected MLCP to skip corrupt files without crashing, but 
> apparently not. Not perfect, but a way around could be to wrap MLCP in 
> another script that loops over the zip files itself, and makes a new MLCP 
> call for each zip. More difficult to do parallelization (e.g. likely slower), 
> but at least it allows you to finish processing completely..
> 
> Can you send me a small example of such a corrupt zip file off-list? I could 
> use that to file a bug against MLCP internally..
> 
> Cheers,
> Geert
> 
> From:  on behalf of "Morales-Martin, 
> Kristina" 
> Reply-To: MarkLogic Developer Discussion 
> Date: Tuesday, July 21, 2015 at 6:58 PM
> To: MarkLogic Developer Discussion 
> Subject: [MarkLogic Dev General] mlcp ability to skip corrupt zip files?
> 
>  
> Dear all,
>  
> We are using the MarkLogic Content Pump to push content from many directories 
> that have zip files that in turn contain .xml files.
> From the last communication with Geet, we are also using the transform option 
> in order to ingest only xml content.  This suggested filtering approach
> using a transform works. 
>  
> Unfortunately, when mlcp encounters a corrupt zip file (which we possibly can 
> get from our sources),
> the process terminates.  Is there an option to instruct mlcp to keep going, 
> that is, to skip the corrupt skip file, and continue processing the large and
> deeply nested directories for the rest of the zip files?  It looks like the 
> -tolerate_errors option won’t work given that we need to use a transform to 
> ingest only xml files,
> and that forces the batch size to 1.
>  
> Please advise.
>  
> We are using the following options:
> -input_file_path $inputFilePath \
> -mode local -input_compressed true \
> -output_uri_replace 
> "(\/.+\/+)(?=.+\.zip),'/ourOverrideOfTheURIToRemoveTheLeadingNASPath/'" \
> -output_collections "$collections" \
> -database $dbName -output_permissions …
> -transform_module /ourNamespace/ourTransformModule.xqy  \
> -transform_namespace "http://cas.org/..."; \
> -xml_repair_level full \
>  
> Thank you,
> Kristina Morales-Martin
> Sr. Technical Information Specialist, e-Content Operations
> CAS, a division of the American Chemical Society
> 2540 Olentangy River Road
> Columbus, OH 43202
> Phone: 614-447-3600, ext. 2322
> Fax: 614-447-3827
> www.cas.org
>  
> Confidentiality Notice: This electronic message transmission, including any 
> attachment(s), may contain confidential, proprietary, or privileged 
> information from Chemical Abstracts Service (“CAS”), a division of the 
> American Chemical Society (“ACS”). If you have received this transmission in 
> error, be advised that any disclosure, copying, distribution, or use of the 
> contents of this information is strictly prohibited. Please destroy all 
> copies of the message and contact the sender immediately by either replying 
> to this message or calling 614-447-3600.
> 
> ___
> General mailing list
> General@developer.marklogic.com
> Manage your subscription at: 
> http://developer.marklogic.com/mailman/listinfo/general
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


Re: [MarkLogic Dev General] mlcp ability to skip corrupt zip files?

2015-07-22 Thread Geert Josten
Hi Kristina,

I would have expected MLCP to skip corrupt files without crashing, but 
apparently not. Not perfect, but a way around could be to wrap MLCP in another 
script that loops over the zip files itself, and makes a new MLCP call for each 
zip. More difficult to do parallelization (e.g. likely slower), but at least it 
allows you to finish processing completely..

Can you send me a small example of such a corrupt zip file off-list? I could 
use that to file a bug against MLCP internally..

Cheers,
Geert

From: 
mailto:general-boun...@developer.marklogic.com>>
 on behalf of "Morales-Martin, Kristina" 
mailto:kmorales-mar...@cas.org>>
Reply-To: MarkLogic Developer Discussion 
mailto:general@developer.marklogic.com>>
Date: Tuesday, July 21, 2015 at 6:58 PM
To: MarkLogic Developer Discussion 
mailto:general@developer.marklogic.com>>
Subject: [MarkLogic Dev General] mlcp ability to skip corrupt zip files?


Dear all,

We are using the MarkLogic Content Pump to push content from many directories 
that have zip files that in turn contain .xml files.
>From the last communication with Geet, we are also using the transform option 
>in order to ingest only xml content.  This suggested filtering approach
using a transform works.

Unfortunately, when mlcp encounters a corrupt zip file (which we possibly can 
get from our sources),
the process terminates.  Is there an option to instruct mlcp to keep going, 
that is, to skip the corrupt skip file, and continue processing the large and
deeply nested directories for the rest of the zip files?  It looks like the 
-tolerate_errors option won’t work given that we need to use a transform to 
ingest only xml files,
and that forces the batch size to 1.

Please advise.

We are using the following options:
-input_file_path $inputFilePath \
-mode local -input_compressed true \
-output_uri_replace 
"(\/.+\/+)(?=.+\.zip),'/ourOverrideOfTheURIToRemoveTheLeadingNASPath/'" \
-output_collections "$collections" \
-database $dbName -output_permissions …
-transform_module /ourNamespace/ourTransformModule.xqy  \
-transform_namespace "http://cas.org/..."; \
-xml_repair_level full \

Thank you,

Kristina Morales-Martin
Sr. Technical Information Specialist, e-Content Operations
CAS, a division of the American Chemical Society
2540 Olentangy River Road
Columbus, OH 43202
Phone: 614-447-3600, ext. 2322
Fax: 614-447-3827
www.cas.org<http://www.cas.org/>


Confidentiality Notice: This electronic message transmission, including any 
attachment(s), may contain confidential, proprietary, or privileged information 
from Chemical Abstracts Service (“CAS”), a division of the American Chemical 
Society (“ACS”). If you have received this transmission in error, be advised 
that any disclosure, copying, distribution, or use of the contents of this 
information is strictly prohibited. Please destroy all copies of the message 
and contact the sender immediately by either replying to this message or 
calling 614-447-3600.
___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general


[MarkLogic Dev General] mlcp ability to skip corrupt zip files?

2015-07-21 Thread Morales-Martin, Kristina

Dear all,

We are using the MarkLogic Content Pump to push content from many directories 
that have zip files that in turn contain .xml files.
>From the last communication with Geet, we are also using the transform option 
>in order to ingest only xml content.  This suggested filtering approach
using a transform works.

Unfortunately, when mlcp encounters a corrupt zip file (which we possibly can 
get from our sources),
the process terminates.  Is there an option to instruct mlcp to keep going, 
that is, to skip the corrupt skip file, and continue processing the large and
deeply nested directories for the rest of the zip files?  It looks like the 
-tolerate_errors option won't work given that we need to use a transform to 
ingest only xml files,
and that forces the batch size to 1.

Please advise.

We are using the following options:
-input_file_path $inputFilePath \
-mode local -input_compressed true \
-output_uri_replace 
"(\/.+\/+)(?=.+\.zip),'/ourOverrideOfTheURIToRemoveTheLeadingNASPath/'" \
-output_collections "$collections" \
-database $dbName -output_permissions ...
-transform_module /ourNamespace/ourTransformModule.xqy  \
-transform_namespace "http://cas.org/..."; \
-xml_repair_level full \

Thank you,

Kristina Morales-Martin
Sr. Technical Information Specialist, e-Content Operations
CAS, a division of the American Chemical Society
2540 Olentangy River Road
Columbus, OH 43202
Phone: 614-447-3600, ext. 2322
Fax: 614-447-3827
www.cas.org


Confidentiality Notice: This electronic message transmission, including any 
attachment(s), may contain confidential, proprietary, or privileged information 
from Chemical Abstracts Service ("CAS"), a division of the American Chemical 
Society ("ACS"). If you have received this transmission in error, be advised 
that any disclosure, copying, distribution, or use of the contents of this 
information is strictly prohibited. Please destroy all copies of the message 
and contact the sender immediately by either replying to this message or 
calling 614-447-3600.

___
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general