[ https://issues.apache.org/jira/browse/NIFI-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex Ethier updated NIFI-12791: ------------------------------- Summary: ParseDocument PDF - Missing pillow-heif dependency (was: ParseDocument PDF - Missing Pillow dependency) > ParseDocument PDF - Missing pillow-heif dependency > -------------------------------------------------- > > Key: NIFI-12791 > URL: https://issues.apache.org/jira/browse/NIFI-12791 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Affects Versions: 2.0.0-M2 > Reporter: Alex Ethier > Assignee: Alex Ethier > Priority: Major > > Custom Python processor ParseDocument, when configured to parse PDFs, gives > an exception below due to a missing import. > The error message `ModuleNotFoundError: No module named 'pillow_heif' > indicates that the latest version of unstructured dependency now requires > 'pillow_heif' to be installed. > Full Stacktrace: > {code:java} > py4j.Py4JException: An exception was raised by the Python Proxy. Return > Message: Traceback (most recent call last): > File "/opt/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py", line > 2466, in _call_proxy > return_value = getattr(self.pool[obj_id], method)(*params) > File "/opt/nifi-2.0.0-SNAPSHOT/python/api/nifiapi/flowfiletransform.py", > line 33, in transformFlowFile > return self.transform(self.process_context, flowfile) > File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line > 257, in transform > documents = self.create_docs(context, flowFile) > File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line > 225, in create_docs > documents = loader.load() > File > "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/unstructured.py", > line 87, in load > elements = self._get_elements() > File > "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/pdf.py", > line 57, in _get_elements > from unstructured.partition.pdf import partition_pdf > File > "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/unstructured/partition/pdf.py", > line 38, in <module> > from pillow_heif import register_heif_opener > ModuleNotFoundError: No module named 'pillow_heif' > at py4j.Protocol.getReturnValue(Protocol.java:476) > at > org.apache.nifi.py4j.client.PythonProxyInvocationHandler.invoke(PythonProxyInvocationHandler.java:64) > at > org.apache.nifi.py4j.client.NiFiPythonGateway$1.invoke(NiFiPythonGateway.java:148) > at jdk.proxy29/jdk.proxy29.$Proxy179.transformFlowFile(Unknown Source) > at > org.apache.nifi.python.processor.FlowFileTransformProxy.onTrigger(FlowFileTransformProxy.java:66) > at > org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) > at > org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1274) > at > org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:244) > at > org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:102) > at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) > at > java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) > at java.base/java.lang.Thread.run(Thread.java:1583) {code} > Including 'pillow-heif' in the list of required dependencies for > ParseDocument fixes the issue (PR forthcoming). > Another possible fix is locking the version numbers to prevent dependencies > from causing breaking updates. -- This message was sent by Atlassian Jira (v8.20.10#820010)