Hi,

I am trying to ingest a large number of files. The metadata for these files 
exist in .met files.

Many of the metadata fields contain characters like '<>&$' etc.

Running crawler on these metadata results in failure.

When I try to escape the characters using HTML encode e.g. '>' becomes &gt etc 
I still get errors and the crawler cannot ingest the files.



Here is an example of the offending lines in the .met file before and after 
HTML encoding

<val>sailfish quant --index /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex 
--libtype 'T=PE:O=><:S=AS' -1 <(gunzip -c 
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz) -2 
<(gunzip -c 
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz) -o 
/gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt
 -p 8  --no_bias_correct  </val>





<val>sailfish quant --index /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex 
--libtype &#39;T=PE:O=&gt;&lt;:S=AS&#39; -1 &lt;(gunzip -c 
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz) -2 
&lt;(gunzip -c 
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz) -o 
/gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt
 -p 8  --no_bias_correct  </val>



If I remove the offending characters ( in this case '<>') the ingestion goes 
one without any issues



The crawler command is :

./crawler_launcher --operation --launchAutoCrawler --productPath $FILEPATH 
--filemgrUrl $OODT_FILEMGR_URL --clientTransferer 
org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory  
--mimeExtractorRe

po ../policy/mime-extractor-map.xml --noRecur --crawlForDirs



The error message I get when I run the crawler is:
INFO: StdIngester: ingesting product: ProductName: [A1_1.Sailfish.sfish]: 
ProductType: [GenericFile]: FileLocation: 
[/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/]

org.apache.xmlrpc.XmlRpcException: java.lang.Exception: 
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error 
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP 
method failed: HTTP/1.1 400 Bad Request

      at 
org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(XmlRpcClientResponseProcessor.java:104)

      at 
org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlRpcClientResponseProcessor.java:71)

      at 
org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.java:73)

      at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:194)

      at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:185)

      at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:178)

      at 
org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient.ingestProduct(XmlRpcFileManagerClient.java:1178)

      at 
org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:199)

      at 
org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)

      at 
org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)

      at 
org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)

      at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)

      at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)

      at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)

Oct 07, 2014 11:17:18 PM 
org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient ingestProduct

SEVERE: Failed to ingest product 
[org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception: 
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error 
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP 
method failed: HTTP/1.1 400 Bad Request -- rolling back ingest

java.lang.Exception: Failed to ingest product 
[org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception: 
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error 
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP 
method failed: HTTP/1.1 400 Bad Request

      at 
org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient.ingestProduct(XmlRpcFileManagerClient.java:1279)

      at 
org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:199)

      at 
org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)

      at 
org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)

      at 
org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)

      at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)

      at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)

      at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)

Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.filemgr.ingest.StdIngester ingest

WARNING: exception ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to 
ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : 
java.lang.Exception: 
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error 
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP 
method failed: HTTP/1.1 400 Bad Request

Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.crawl.ProductCrawler ingest

WARNING: ProductCrawler: Exception ingesting product: 
[/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/A1_1.Sailfish.sfish]: 
Message: exception ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to 
ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : 
java.lang.Exception: 
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error 
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP 
method failed: HTTP/1.1 400 Bad Request: attempting to continue crawling

org.apache.oodt.cas.filemgr.structs.exceptions.IngestException: exception 
ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to ingest product 
[org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception: 
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error 
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP 
method failed: HTTP/1.1 400 Bad Request

      at 
org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:204)

      at 
org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)

      at 
org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)

      at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)

      at 
org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)

      at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)

      at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)

      at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)



Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.crawl.ProductCrawler handleFile

WARNING: Failed to ingest product: 
[/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/A1_1.Sailfish.sfish]: 
performing postIngestFail actions



Any ideas how I can ingest these files?

Thanks
K


*********************************************************
THIS ELECTRONIC MAIL MESSAGE AND ANY ATTACHMENT IS
CONFIDENTIAL AND MAY CONTAIN LEGALLY PRIVILEGED
INFORMATION INTENDED ONLY FOR THE USE OF THE INDIVIDUAL
OR INDIVIDUALS NAMED ABOVE.
If the reader is not the intended recipient, or the
employee or agent responsible to deliver it to the
intended recipient, you are hereby notified that any
dissemination, distribution or copying of this
communication is strictly prohibited. If you have
received this communication in error, please reply to the
sender to notify us of the error and delete the original
message. Thank You.

Reply via email to