Hi All,

We recently faced the known issue where pyspark does not work when the assembly 
jar contains more than 65K files. Our build and run time environment are both 
Java 7 but python fails to unzip the assembly jar as expected 
(https://issues.apache.org/jira/browse/SPARK-1911).

All nodes in our YARN cluster have spark deployed (at the same local location) 
on them so we are contemplating the following workaround (apart from using a 
Java 6 compiled assembly):

Modify PYTHONPATH to give preference to "$SPARK_HOME/python" & 
"$SPARK_HOME/python/lib/py4j-0.8.1-src.zip", with this the assembly does not 
need to be unzipped to access the python files. This worked fine for with my 
limited testing. And I think, this should work as long as the only reason to 
unzip the assembly jar is to extract the python files and nothing else (any 
reason to believe that this may not be the case?).

I would appreciate your opinion on this workaround.

Thanks,
Rahul Singhal

Reply via email to