Re: Confusing RDD function

2016-03-08 Thread Hemminger Jeff
a transformation and thus is not > actually applied until some action (like 'foreach') is called on the > resulting RDD. > You can find more information in the Spark Programming Guide > http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations. > > best, > --Ja

Confusing RDD function

2016-03-08 Thread Hemminger Jeff
I'm currently developing a Spark Streaming application. I have a function that receives an RDD and an object instance as a parameter, and returns an RDD: def doTheThing(a: RDD[A], b: B): RDD[C] Within the function, I do some processing within a map of the RDD. Like this: def doTheThing(a:

String operation in filter with a special character

2015-10-05 Thread Hemminger Jeff
I have a rather odd use case. I have a DataFrame column name with a + value in it. The app performs some processing steps before determining the column name, and it would be much easier to code if I could use the DataFrame filter operations with a String. This demonstrates the issue I am having:

Re: spark-ec2 config files.

2015-10-05 Thread Hemminger Jeff
The spark-ec2 script generates spark config files from templates. Those are located here: https://github.com/amplab/spark-ec2/tree/branch-1.5/templates/root/spark/conf Note the link is referring to the 1.5 branch. Is this what you are looking for? Jeff On Mon, Oct 5, 2015 at 8:56 AM, Renato

Re: String operation in filter with a special character

2015-10-05 Thread Hemminger Jeff
escape weird characters in column names. > > On Mon, Oct 5, 2015 at 12:59 AM, Hemminger Jeff <j...@atware.co.jp> wrote: > >> I have a rather odd use case. I have a DataFrame column name with a + >> value in it. >> The app performs some processing steps before determ

What happens when cache is full?

2015-09-12 Thread Hemminger Jeff
I am trying to understand the process of caching and specifically what the behavior is when the cache is full. Please excuse me if this question is a little vague, I am trying to build my understanding of this process. I have an RDD that I perform several computations with, I persist it with

Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Hemminger Jeff
need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)? I haven't done this myself though, so I'm just throwing the idea out there. On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff j...@atware.co.jp wrote: Hi, I am working

Alternative to Large Broadcast Variables

2015-08-28 Thread Hemminger Jeff
Hi, I am working on a Spark application that is using of a large (~3G) broadcast variable as a lookup table. The application refines the data in this lookup table in an iterative manner. So this large variable is broadcast many times during the lifetime of the application process. From what I