OK I tried that, but how do I convert an RDD to a Set that I can then
broadcast and cache?

      val badIPs = sc.textFile("hdfs:///user/jon/"+ "badfullIPs.csv")
      val badIPsLines = badIPs.getLines
      val badIpSet = badIPsLines.toSet
      val badIPsBC = sc.broadcast(badIpSet)

produces the error "value getLines is not a member of
org.apache.spark.rdd.RDD[String]".

Leaving it as an RDD and then constantly joining I think will be too slow
for a streaming job.

On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hi Jon,
>
> You'll need to put the file on HDFS (or whatever distributed filesystem
> you're running on) and load it from there.
>
> -Sandy
>
> On Thu, Feb 5, 2015 at 3:18 PM, YaoPau <jonrgr...@gmail.com> wrote:
>
>> I have a file "badFullIPs.csv" of bad IP addresses used for filtering.  In
>> yarn-client mode, I simply read it off the edge node, transform it, and
>> then
>> broadcast it:
>>
>>       val badIPs = fromFile(edgeDir + "badfullIPs.csv")
>>       val badIPsLines = badIPs.getLines
>>       val badIpSet = badIPsLines.toSet
>>       val badIPsBC = sc.broadcast(badIpSet)
>>       badIPs.close
>>
>> How can I accomplish this in yarn-cluster mode?
>>
>> Jon
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-variable-read-from-a-file-in-yarn-cluster-mode-tp21524.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to