More update on the issue:
We have extracted the lvg related files in the exact folder structure, and are
copying all the folders recursively in the spark executor working directory
using addFiles option. But the LvgAnnotator is not able to find the
lvg.properties file in the classpath of the spark executor even though we have
set up using the configuration spark.executor.extraClassPath option
Code snippet:
sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
sparkConf.set("spark.executor.extraClassPath", "./resources/");
sparkConf.set("spark.driver.extraClassPath", "./resources/");
From: Eskala, Nagakalyana
Sent: Monday, April 30, 2018 8:50 PM
To: '[email protected]' <[email protected]>
Subject: cTakes on Apache Spark - Error
Background:
We are trying to run the Apache ctakes Default clinical pipeline in a spark
streaming application. We intend to parse all input text sent to a socket on
spark streaming by executing a default clinical pipeline in individual
executors of a spark application.
Challenges:
The ctakes pipeline requires external resources to be available in the
classpath. We have used JavaSparkContext.addFiles to provide all the resources
(dictionaries) recursively from HDFS to each individual executor working
directory. Once the addFiles copies the resources to each executor, we try to
include it in the classpath of each executor using the configuration.
sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
sparkConf.set("spark.executor.extraClassPath", "./resources/");
sparkConf.set("spark.driver.extraClassPath", "./resources/");
Error:
The error occurs in LvgAnnotator class which tries to access the lvg.properties
file through the lookup. It is not able to locate the file and hence there is
an error.
18/04/30 15:55:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0
(TID 1, localhost, executor driver, partition 0, ANY, 4744 bytes)
18/04/30 15:55:50 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
18/04/30 15:55:51 INFO ae.LvgAnnotator: URL==null
18/04/30 15:55:51 INFO ae.LvgAnnotator: Unable to find
org/apache/ctakes/lvg/data/config/lvg.properties.
18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying files and directories to under
/tmp/
18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying lvg-related file to
/tmp/data/config/lvg.properties
18/04/30 15:55:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0
(TID 1)
java.lang.NullPointerException
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.FileUtils.copyInputStreamToFile(FileUtils.java:1512)
at org.apache.ctakes.lvg.ae.LvgAnnotator.copyLvgFiles(LvgAnnotator.java:620)
at
org.apache.ctakes.lvg.ae.LvgAnnotator.createAnnotatorDescription(LvgAnnotator.java:649)
at
org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getTokenProcessingPipeline(ClinicalPipelineFactory.java:110)
at
org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getDefaultPipeline(ClinicalPipelineFactory.java:68)
Question:
Ideally, since the resources folder has been recursively added to each executor
node and the classpath has been set, the internal executor should be able to
locate the properties and other resource files. However, that is not the case.
Is there something we should be differently doing (configuration, classpath,
etc) so that the ctakes pipeline can be run in a spark executor with all the
resources and classpath set appropriately.
Thanks for the help.
CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is
for the sole use of the intended recipient(s) and may contain confidential
and privileged information or may otherwise be protected by law. Any
unauthorized review, use, disclosure or distribution is prohibited. If you
are not the intended recipient, please contact the sender by reply e-mail
and destroy all copies of the original message and any attachment thereto.