Hi Aishwarya,
Yes, it is quite strange that Jupyter isn't running on the PySpark kernel even
though it's being started in that manner. The good news is that we do use this
everyday, so once we find the root issue with your Jupyter, it should work
great! Let's try temporarily removing all of the existing Jupyter/IPython
settings & kernels and basically start fresh. Assuming you are on OS X / macOS
or Linux, can you do the following? (Please double check the exact paths, as
I'm typing on a phone.)
* Stop Jupyter, and make sure that it is not running.
* Temporarily remove the Jupyter kernels. First, you will need to see where
they are installed, and then just rename that path.
`jupyter kernelspec list`
# look at paths above. For example, on macOS, it may be located at
~/Library/Jupyter/kernels, and thus to move it, you would use the following.
Update this as needed for the exact paths listed above
`mv ~/Library/Jupyter/kernels ~/Library/Jupyter_OLD/kernels`
* Temporarily remove the Jupyter & IPython settings:
`mv ~/.jupyter ~/.jupyter_OLD`
`mv ~/.ipython ~/.ipython_OLD`
* Make sure Jupyter is up to date:
`pip3 install -U ipython jupyter`
After that, please ensure that Jupyter is not running, then start it in the
context of PySpark as sent previously. Once Jupyter is started this time,
there should only be one kernel listed, and `sc` should be available.
Can you try that?
--
Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry
Sent from my iPhone.
> On Apr 26, 2017, at 2:13 AM, Aishwarya Chaurasia
> wrote:
>
> Hi sir,
> The sc NameError persists.
>
> (1) There is only one jupyter server running. And that was started with the
> pyspark command in the previous mail.
> (2) Two kernels are appearing in the change kernel option - Python3 and
> Python2. Tried with both of them and the result is the same.
>
> How is jupyter not being able to run on the pyspark kernel when we have
> started the notebook with the pyspark command only?
>
> Is it possible to create a .py file of MachineLearning.ipynb like was done
> with preprocessing.ipynb with explicitly creating a SparkContext() ?
>
>> On 25-Apr-2017 11:57 PM, wrote:
>>
>> Hi Aishwarya,
>>
>> Unfortunately this mailing list removes all images, so I can't view your
>> screenshot. I'm assuming that it is the same issue with the missing
>> SparkContext `sc` object, but please let me know if it is a different
>> issue. This sounds like it could be an issue with multiple kernels
>> installed in Jupyter. When you start the notebook, can you see if there
>> are multiple kernels listed in the "Kernel" -> "Change Kernel" menu? If
>> so, please try one of the other kernels to see if Jupyter is starting by
>> default with a non-spark kernel. Also, is it possible that you have more
>> than one instance of the Jupyter server running? I.e. for this scenario,
>> we start Jupyter itself directly via pyspark using the command sent
>> previously, whereas usually Jupyter can just be started with `jupyter
>> notebook`. In the latter case, PySpark (and thus `sc`) would *not* be
>> available (unless you've set up special PySpark kernels separately). In
>> summary, can you (1) check for other kernels via the menus, and (2) check
>> for other running Jupyter servers that are non-PySpark?
>>
>> As for the other inquiry, great question! When training models, it's
>> quite useful to track the loss and other metrics (i.e. accuracy) from
>> *both* the training and validation sets. The reasoning is that it allows
>> for a more holistic view of the overall learning process, such as
>> evaluating whether any overfitting or underfitting is occurring. For
>> example, say that you train a model and achieve an accuracy of 80% on the
>> validation set. Is this good? Is this the best that can be done? Without
>> also tracking performance on the training set, it can be difficult to make
>> these decisions. Say that you then measure the performance on the training
>> set and find that the model achieves 100% accuracy on that data. That
>> might be a good indication that your model is overfitting the training set,
>> and that a combination of more data, regularization, and a smaller model
>> may be helpful in raising the generalization performance, i.e. the
>> performance on the validation set and future real examples on which you
>> wish to make predictions. If on the other hand, the model achieved an 82%
>> on the training set, this could be a good indication that the model is
>> underfitting, and that a combination of a more expressive model and better
>> data could be helpful. In summary, tracking performance on both the
>> training and validation datasets can be useful for determining ways in
>> which to improve the overall learning process.
>>
>>
>> - Mike
>>
>> --
>>
>> Mike Dusenberry
>> GitHub: github.com/dusenberrymw
>> LinkedIn: