Hi Aishwarya,

Thanks for sharing more info on the issue!

To facilitate easier usage, I've updated the preprocessing code by pulling out 
most of the logic into a `breastcancer/preprocessing.py` module, leaving just 
the execution in the `Preprocessing.ipynb` notebook.  There is also a 
`preprocess.py` script with the same contents as the notebook for use with 
`spark-submit`.  The choice of the notebook or the script is just a matter of 
convenience, as they both import from the same `breastcancer/preprocessing.py` 
package.  

As part of the updates, I've added an explicit SparkSession parameter (`spark`) 
to the `preprocess(...)` function, and updated the body to use this 
SparkSession object rather than the older SparkContext `sc` object.  
Previously, the `preprocess(...)` function accessed the `sc` object that was 
pulled in from the enclosing scope, which would work while all of the code was 
colocated within the notebook, but not if the code was extracted and imported.  
The explicit parameter now allows for the code to be imported.

Can you please try again with the latest updates?  We are currently using Spark 
2.x with Python 3.  If you use the notebook, the pyspark kernel should have a 
`spark` object available that can be supplied to the functions (as is done now 
in the notebook), and if you use the `preprocess.py` script with 
`spark-submit`, the `spark` object will be created explicitly by the script.

For a bit of context to others, Aishwarya initially reached out to find out if 
our breast cancer project could be applied to TIFF images, rather than the SVS 
images we are currently using (the answer is "yes" so long as they are "generic 
tiled TIFF images, according to the OpenSlide documentation), and then followed 
up with Spark issues related to the preprocessing code.  This conversation has 
been promptly moved to the mailing list so that others in the community can 
benefit.


Thanks!

-Mike

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <aishwarya2...@gmail.com> 
> wrote:
> 
> Hey,
> 
> The object sc is already defined in pyspark and yet this name error keeps
> occurring. We are using spark 2.*
> 
> Here is the link to error that we are getting :
> https://paste.fedoraproject.org/paste/89iQODxzpNZVbSfgwocH8l5M1UNdIGYhyRLivL9gydE=

Reply via email to