Hi,
I'm just putting my hands on Spark and I wrote a simple job in scala.
It sketches like:
val TAB = "\t"
val support = 2
val sc = new SparkContext(...)
val raw = sc.textFile(...)
val filtered = raw.map(
line => {
val lineSplit = line.split(TAB) // TAB is null and exception is thrown
during the run
...
}).filter( p => p._2 >= support) // support here is 0 during the run
...
I run the sbt-assembly jar like "java -cp ..." on a standalone cluster, I
found out that when referenced in the RDD transformation, the 2 values, TAB
and support, are set to their default values. So TAB is null, and support
is 0 and no longer "\t" and 2 as they are initialized above.
If the same jar is run locally (MASTER is local or local[k] instead of
spark://...) on the same input, it runs perfectly. The code also runs well
in spark-shell on cluster.
For the jar to run correctly on cluster, I have to hard code the string
literal and the number in the RDD transformation part.
It really seems to me a weird bug, maybe it has something to do with the
sbt-assembly jar compilation? Some suggestions?
Thanks.
I'm using spark version 0.7.3 and scala 2.9.3.
--
*JU Han*
Software Engineer Intern @ KXEN Inc.
UTC - Université de Technologie de Compiègne
* **GI06 - Fouille de Données et Décisionnel*
+33 0619608888