Re: Unit test with sqlContext

2016-02-05 Thread Steve Annessa
Thanks for all of the responses.

I do have an afterAll that stops the sc.

While looking over Holden's readme I noticed she mentioned "Make sure to
disable parallel execution." That was what I was missing; I added the
follow to my build.sbt:

```
parallelExecution in Test := false
```

Now all of my tests are running.

I'm going to look into using the package she created.

Thanks again,

-- Steve


On Thu, Feb 4, 2016 at 8:50 PM, Rishi Mishra <rmis...@snappydata.io> wrote:

> Hi Steve,
> Have you cleaned up your SparkContext ( sc.stop())  , in a afterAll(). The
> error suggests you are creating more than one SparkContext.
>
>
> On Fri, Feb 5, 2016 at 10:04 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> Thanks for recommending spark-testing-base :) Just wanted to add if
>> anyone has feature requests for Spark testing please get in touch (or add
>> an issue on the github) :)
>>
>>
>> On Thu, Feb 4, 2016 at 8:25 PM, Silvio Fiorito <
>> silvio.fior...@granturing.com> wrote:
>>
>>> Hi Steve,
>>>
>>> Have you looked at the spark-testing-base package by Holden? It’s really
>>> useful for unit testing Spark apps as it handles all the bootstrapping for
>>> you.
>>>
>>> https://github.com/holdenk/spark-testing-base
>>>
>>> DataFrame examples are here:
>>> https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/scala/com/holdenkarau/spark/testing/SampleDataFrameTest.scala
>>>
>>> Thanks,
>>> Silvio
>>>
>>> From: Steve Annessa <steve.anne...@gmail.com>
>>> Date: Thursday, February 4, 2016 at 8:36 PM
>>> To: "user@spark.apache.org" <user@spark.apache.org>
>>> Subject: Unit test with sqlContext
>>>
>>> I'm trying to unit test a function that reads in a JSON file,
>>> manipulates the DF and then returns a Scala Map.
>>>
>>> The function has signature:
>>> def ingest(dataLocation: String, sc: SparkContext, sqlContext:
>>> SQLContext)
>>>
>>> I've created a bootstrap spec for spark jobs that instantiates the Spark
>>> Context and SQLContext like so:
>>>
>>> @transient var sc: SparkContext = _
>>> @transient var sqlContext: SQLContext = _
>>>
>>> override def beforeAll = {
>>>   System.clearProperty("spark.driver.port")
>>>   System.clearProperty("spark.hostPort")
>>>
>>>   val conf = new SparkConf()
>>> .setMaster(master)
>>> .setAppName(appName)
>>>
>>>   sc = new SparkContext(conf)
>>>   sqlContext = new SQLContext(sc)
>>> }
>>>
>>> When I do not include sqlContext, my tests run. Once I add the
>>> sqlContext I get the following errors:
>>>
>>> 16/02/04 17:31:58 WARN SparkContext: Another SparkContext is being
>>> constructed (or threw an exception in its constructor).  This may indicate
>>> an error, since only one SparkContext may be running in this JVM (see
>>> SPARK-2243). The other SparkContext was created at:
>>> org.apache.spark.SparkContext.(SparkContext.scala:81)
>>>
>>> 16/02/04 17:31:59 ERROR SparkContext: Error initializing SparkContext.
>>> akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is
>>> not unique!
>>>
>>> and finally:
>>>
>>> [info] IngestSpec:
>>> [info] Exception encountered when attempting to run a suite with class
>>> name: com.company.package.IngestSpec *** ABORTED ***
>>> [info]   akka.actor.InvalidActorNameException: actor name
>>> [ExecutorEndpoint] is not unique!
>>>
>>>
>>> What do I need to do to get a sqlContext through my tests?
>>>
>>> Thanks,
>>>
>>> -- Steve
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>


Unit test with sqlContext

2016-02-04 Thread Steve Annessa
I'm trying to unit test a function that reads in a JSON file, manipulates
the DF and then returns a Scala Map.

The function has signature:
def ingest(dataLocation: String, sc: SparkContext, sqlContext: SQLContext)

I've created a bootstrap spec for spark jobs that instantiates the Spark
Context and SQLContext like so:

@transient var sc: SparkContext = _
@transient var sqlContext: SQLContext = _

override def beforeAll = {
  System.clearProperty("spark.driver.port")
  System.clearProperty("spark.hostPort")

  val conf = new SparkConf()
.setMaster(master)
.setAppName(appName)

  sc = new SparkContext(conf)
  sqlContext = new SQLContext(sc)
}

When I do not include sqlContext, my tests run. Once I add the sqlContext I
get the following errors:

16/02/04 17:31:58 WARN SparkContext: Another SparkContext is being
constructed (or threw an exception in its constructor).  This may indicate
an error, since only one SparkContext may be running in this JVM (see
SPARK-2243). The other SparkContext was created at:
org.apache.spark.SparkContext.(SparkContext.scala:81)

16/02/04 17:31:59 ERROR SparkContext: Error initializing SparkContext.
akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is not
unique!

and finally:

[info] IngestSpec:
[info] Exception encountered when attempting to run a suite with class
name: com.company.package.IngestSpec *** ABORTED ***
[info]   akka.actor.InvalidActorNameException: actor name
[ExecutorEndpoint] is not unique!


What do I need to do to get a sqlContext through my tests?

Thanks,

-- Steve


DataFrame repartition not repartitioning

2015-09-16 Thread Steve Annessa
Hello,

I'm trying to load in an Avro file and write it out as Parquet. I would
like to have enough partitions to properly parallelize on. When I do the
simple load and save I get 1 partition out. I thought I would be able to
use repartition like the following:

val avroFile =
sqlContext.read.format("com.databricks.spark.avro").load(inFile)
avroFile.repartition(10)
avroFile.save(outFile, "parquet")

However, the saved file is still a single partition in the directory.

What am I missing?

Thanks,

-- Steve