Hi guys,
First let me say (again) that as a small startup we really love
zeppelin+Spark, it gave us
a huge performance gain for data analytics and reporting.
We have faced a recurrent problem while using the same zeppelin server with
multiple users,
with Spark on YARN:
The command lines are directly sent to a spark-shell, and thus all
notebooks share the same context. This can have dangerous side effects
because a variable in one notebook can be
overridden by another notebook.
By giving it some thought, I found an
ugly-but-no-too-complicated-to-implement hack that
should solve that problem:
The idea would be to wrap each command send to the spark-shell with this:
object ZepContext_<NOTEBOOK_ID>_<COMMAND_NUM> {
import ZepContext_<NOTEBOOK_ID>_<ALL_PREVIOUS_COMMAND_NUM>._
// insert command here
}
ZepContext_<NOTEBOOK_ID>_<COMMAND_NUM> ; ()
This way, we would wrap everything inside a class that would be accessible
only by this notebook, and thanks to the import we would get all previously
defined variables too.
You can find a concrete Hello World example in the attached file.
This would also require a special treatment for imports, are they would not
be propagated
from one ZepContext to the other.
Of course, there would be many dangers with this approach, like
The SparkContext would still be shared by all notebooks, but I guess it
would be fine.
We might need a different ZeppelinContext for each notebook?
How would jar loading behave, would we have any side effect?
Will this cause (even) more PermGen Space issues?
Finally, there is another (nice ?) side-effect that this solution would
have:
If I have a notebook like this
Command 1 :
println(foo)
Command 2 :
val foo = "foo"
and if I run command 2 then command 1, it will work with the current
behavior, but not with the hack.
I guess that ideally, it would be easier to launch one spark-shell per
notebook, but I don't know if this is easily feasible, and this would
probably be a poor optimization of a cluster's resources.
Please tell me what do you guys think, do you have a better approach in
mind?
Cheers,
Furcy
// In notebook A
object ZepContext_A_1 {
val hello = "Hello"
}
ZepContext_A_1 ; ()
object ZepContext_A_2 {
import ZepContext_A_1._
val world = "World"
}
ZepContext_A_2 ; ()
object ZepContext_A_3 {
import ZepContext_A_1._
import ZepContext_A_2._
println(hello)
println(world)
}
ZepContext_A_3 ; ()
// In notebook B
object ZepContext_B_1 {
val world = "World"
}
ZepContext_B_1 ; ()
object ZepContext_B_2 {
import ZepContext_B_1._
println(hello)
println(world)
}
ZepContext_B_2 ; ()
// This line fails since the variable hello is not defined
// In notebook C
object ZepContext_C_1 {
import scala.util.Try
val hello = "World"
val world = "World"
Try(println(hello))
}
ZepContext_C_1 ; ()
object ZepContext_C_2 {
import ZepContext_C_1._
Try(println(world))
}
ZepContext_C_2 ; ()
// This line fails because Try is not (re-)imported in ZepContext_C_2