scala test is unable to initialize spark context.

2017-04-06 Thread PSwain
Hi All ,

   I am just trying to use scala test for testing a small spark code . But 
spark context is not getting initialized , while I am running test file .
I have given code, pom and exception I am getting in mail , please help me to 
understand what mistake I am doing , so that
Spark context is not getting initialized

Code:-

import org.apache.log4j.LogManager
import org.apache.spark.SharedSparkContext
import org.scalatest.FunSuite
import org.apache.spark.{SparkContext, SparkConf}

/**
 * Created by PSwain on 4/5/2017.
  */
class Test extends FunSuite with SharedSparkContext  {


  test("test initializing spark context") {
val list = List(1, 2, 3, 4)
val rdd = sc.parallelize(list)
assert(list.length === rdd.count())
  }
}

POM File:-



http://maven.apache.org/POM/4.0.0;
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
4.0.0

tesing.loging
logging
1.0-SNAPSHOT




central
central
http://repo1.maven.org/maven/





org.apache.spark
spark-core_2.10
1.6.0
test-jar




org.apache.spark
spark-sql_2.10
1.6.0



org.scalatest
scalatest_2.10
2.2.6



org.apache.spark
spark-hive_2.10
1.5.0
provided


com.databricks
spark-csv_2.10
1.3.0


com.rxcorp.bdf.logging
loggingframework
1.0-SNAPSHOT


mysql
mysql-connector-java
5.1.6
provided



org.scala-lang
scala-library
2.10.5
compile
true



org.scalatest
scalatest
1.4.RC2



log4j
log4j
1.2.17



org.scala-lang
scala-compiler
2.10.5
compile
true




src/main/scala


maven-assembly-plugin
2.2.1


jar-with-dependencies




make-assembly
package

single





net.alchim31.maven
scala-maven-plugin
3.2.0



compile
testCompile




src/main/scala


-Xms64m
-Xmx1024m












Exception:-



An exception or error caused a run to abort.

java.lang.ExceptionInInitializerError

 at org.apache.spark.Logging$class.initializeLogging(Logging.scala:121)

 at 
org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:106)

 at org.apache.spark.Logging$class.log(Logging.scala:50)

 at org.apache.spark.SparkContext.log(SparkContext.scala:79)

 at org.apache.spark.Logging$class.logInfo(Logging.scala:58)

 at org.apache.spark.SparkContext.logInfo(SparkContext.scala:79)

 at org.apache.spark.SparkContext.(SparkContext.scala:211)

 at org.apache.spark.SparkContext.(SparkContext.scala:147)

 at 
org.apache.spark.SharedSparkContext$class.beforeAll(SharedSparkContext.scala:33)

 at Test.beforeAll(Test.scala:10)

 at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)

 at Test.beforeAll(Test.scala:10)

 at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)

 at Test.run(Test.scala:10)

 at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)

 at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)

 at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)

 at 
org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)

 at 
org.scalatest.tools.Runner$$anonfun$runOptionall

RE: Huge partitioning job takes longer to close after all tasks finished

2017-03-09 Thread PSwain
Hi Swapnil,

  We are facing same issue , could you please let me know how did you find that 
partitions are getting merged ?

Thanks in advance !!

From: Swapnil Shinde [mailto:swapnilushi...@gmail.com]
Sent: Thursday, March 09, 2017 1:31 AM
To: cht liu 
Cc: user@spark.apache.org
Subject: Re: Huge partitioning job takes longer to close after all tasks 
finished

Thank you liu. Can you please explain what do you mean by enabling spark fault 
tolerant mechanism?
I observed that after all tasks finishes, spark is working on concatenating 
same partitions from all tasks on file system. eg,
task1 - partition1, partition2, partition3
task2 - partition1, partition2, partition3

Then after task1, task2 finishes, spark concatenates partition1 from task1, 
task2 to create partition1. This is taking longer if we have large number of 
files. I am not sure if there is a way to let spark not to concatenate 
partitions from each task.

Thanks
Swapnil


On Tue, Mar 7, 2017 at 10:47 PM, cht liu 
> wrote:

Do you enable the spark fault tolerance mechanism, RDD run at the end of the 
job, will start a separate job, to the checkpoint data written to the file 
system before the persistence of high availability

2017-03-08 2:45 GMT+08:00 Swapnil Shinde 
>:
Hello all
   I have a spark job that reads parquet data and partition it based on one of 
the columns. I made sure partitions equally distributed and not skewed. My code 
looks like this -

datasetA.write.partitonBy("column1").parquet(outputPath)

Execution plan -
[Inline image 1]

All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins to 
close application. I am not sure what spark is doing after all tasks are 
processes successfully.
I checked thread dump (using UI executor tab) on few executors but couldnt find 
anything major. Overall, few shuffle-client processes are "RUNNABLE" and few 
dispatched-* processes are "WAITING".

Please let me know what spark is doing at this stage(after all tasks finished) 
and any way I can optimize it.

Thanks
Swapnil





** IMPORTANT--PLEASE READ 
This electronic message, including its attachments, is CONFIDENTIAL and may 
contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is 
intended for the authorized recipient of the sender.
If you are not the intended recipient, you are hereby notified that any use, 
disclosure, copying, or distribution of this message or any of the information 
included in it is unauthorized and strictly prohibited.
If you have received this message in error, please immediately notify the 
sender by reply e-mail and permanently delete this message and its attachments, 
along with any copies thereof, from all locations received (e.g., computer, 
mobile device, etc.).
Thank you.



Not able to remove header from a text file while creating a data frame .

2017-03-04 Thread PSwain
Hi All,

I am reading a text file to create a dataframe . While I am trying to exclude 
header form the text file I am not able r to do it .
Now my concern is how to know what all options are there that I can use while 
reading from a source , I checked the API , there the
Arguments in option method are string .

Could any one of you suggests  how can I find all possible options ?


Code :-
scala> val df1 
=hc.read.format("text").option("header","false").load("/development/data/swain/people.txt")
df1: org.apache.spark.sql.DataFrame = [value: string]

scala> df1.show()
++
|   value|
++
|name|age|address|...|
|   N1|27|A1|1201|
|   N2|28|A2|1202|
|   N3|29|A3|1203|
|   N4|30|A4|1204|



Thanks,

** IMPORTANT--PLEASE READ 
This electronic message, including its attachments, is CONFIDENTIAL and may 
contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is 
intended for the authorized recipient of the sender.
If you are not the intended recipient, you are hereby notified that any use, 
disclosure, copying, or distribution of this message or any of the information 
included in it is unauthorized and strictly prohibited.
If you have received this message in error, please immediately notify the 
sender by reply e-mail and permanently delete this message and its attachments, 
along with any copies thereof, from all locations received (e.g., computer, 
mobile device, etc.).
Thank you.