OK I tried that, but how do I convert an RDD to a Set that I can then broadcast and cache?
val badIPs = sc.textFile("hdfs:///user/jon/"+ "badfullIPs.csv") val badIPsLines = badIPs.getLines val badIpSet = badIPsLines.toSet val badIPsBC = sc.broadcast(badIpSet) produces the error "value getLines is not a member of org.apache.spark.rdd.RDD[String]". Leaving it as an RDD and then constantly joining I think will be too slow for a streaming job. On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Hi Jon, > > You'll need to put the file on HDFS (or whatever distributed filesystem > you're running on) and load it from there. > > -Sandy > > On Thu, Feb 5, 2015 at 3:18 PM, YaoPau <jonrgr...@gmail.com> wrote: > >> I have a file "badFullIPs.csv" of bad IP addresses used for filtering. In >> yarn-client mode, I simply read it off the edge node, transform it, and >> then >> broadcast it: >> >> val badIPs = fromFile(edgeDir + "badfullIPs.csv") >> val badIPsLines = badIPs.getLines >> val badIpSet = badIPsLines.toSet >> val badIPsBC = sc.broadcast(badIpSet) >> badIPs.close >> >> How can I accomplish this in yarn-cluster mode? >> >> Jon >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-variable-read-from-a-file-in-yarn-cluster-mode-tp21524.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >