scala test is unable to initialize spark context.
Hi All , I am just trying to use scala test for testing a small spark code . But spark context is not getting initialized , while I am running test file . I have given code, pom and exception I am getting in mail , please help me to understand what mistake I am doing , so that Spark context is not getting initialized Code:- import org.apache.log4j.LogManager import org.apache.spark.SharedSparkContext import org.scalatest.FunSuite import org.apache.spark.{SparkContext, SparkConf} /** * Created by PSwain on 4/5/2017. */ class Test extends FunSuite with SharedSparkContext { test("test initializing spark context") { val list = List(1, 2, 3, 4) val rdd = sc.parallelize(list) assert(list.length === rdd.count()) } } POM File:- http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> 4.0.0 tesing.loging logging 1.0-SNAPSHOT central central http://repo1.maven.org/maven/ org.apache.spark spark-core_2.10 1.6.0 test-jar org.apache.spark spark-sql_2.10 1.6.0 org.scalatest scalatest_2.10 2.2.6 org.apache.spark spark-hive_2.10 1.5.0 provided com.databricks spark-csv_2.10 1.3.0 com.rxcorp.bdf.logging loggingframework 1.0-SNAPSHOT mysql mysql-connector-java 5.1.6 provided org.scala-lang scala-library 2.10.5 compile true org.scalatest scalatest 1.4.RC2 log4j log4j 1.2.17 org.scala-lang scala-compiler 2.10.5 compile true src/main/scala maven-assembly-plugin 2.2.1 jar-with-dependencies make-assembly package single net.alchim31.maven scala-maven-plugin 3.2.0 compile testCompile src/main/scala -Xms64m -Xmx1024m Exception:- An exception or error caused a run to abort. java.lang.ExceptionInInitializerError at org.apache.spark.Logging$class.initializeLogging(Logging.scala:121) at org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:106) at org.apache.spark.Logging$class.log(Logging.scala:50) at org.apache.spark.SparkContext.log(SparkContext.scala:79) at org.apache.spark.Logging$class.logInfo(Logging.scala:58) at org.apache.spark.SparkContext.logInfo(SparkContext.scala:79) at org.apache.spark.SparkContext.(SparkContext.scala:211) at org.apache.spark.SparkContext.(SparkContext.scala:147) at org.apache.spark.SharedSparkContext$class.beforeAll(SharedSparkContext.scala:33) at Test.beforeAll(Test.scala:10) at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) at Test.beforeAll(Test.scala:10) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) at Test.run(Test.scala:10) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55) at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563) at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557) at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044) at org.scalatest.tools.Runner$$anonfun$runOptionall
RE: Huge partitioning job takes longer to close after all tasks finished
Hi Swapnil, We are facing same issue , could you please let me know how did you find that partitions are getting merged ? Thanks in advance !! From: Swapnil Shinde [mailto:swapnilushi...@gmail.com] Sent: Thursday, March 09, 2017 1:31 AM To: cht liuCc: user@spark.apache.org Subject: Re: Huge partitioning job takes longer to close after all tasks finished Thank you liu. Can you please explain what do you mean by enabling spark fault tolerant mechanism? I observed that after all tasks finishes, spark is working on concatenating same partitions from all tasks on file system. eg, task1 - partition1, partition2, partition3 task2 - partition1, partition2, partition3 Then after task1, task2 finishes, spark concatenates partition1 from task1, task2 to create partition1. This is taking longer if we have large number of files. I am not sure if there is a way to let spark not to concatenate partitions from each task. Thanks Swapnil On Tue, Mar 7, 2017 at 10:47 PM, cht liu > wrote: Do you enable the spark fault tolerance mechanism, RDD run at the end of the job, will start a separate job, to the checkpoint data written to the file system before the persistence of high availability 2017-03-08 2:45 GMT+08:00 Swapnil Shinde >: Hello all I have a spark job that reads parquet data and partition it based on one of the columns. I made sure partitions equally distributed and not skewed. My code looks like this - datasetA.write.partitonBy("column1").parquet(outputPath) Execution plan - [Inline image 1] All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins to close application. I am not sure what spark is doing after all tasks are processes successfully. I checked thread dump (using UI executor tab) on few executors but couldnt find anything major. Overall, few shuffle-client processes are "RUNNABLE" and few dispatched-* processes are "WAITING". Please let me know what spark is doing at this stage(after all tasks finished) and any way I can optimize it. Thanks Swapnil ** IMPORTANT--PLEASE READ This electronic message, including its attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is intended for the authorized recipient of the sender. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying, or distribution of this message or any of the information included in it is unauthorized and strictly prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and permanently delete this message and its attachments, along with any copies thereof, from all locations received (e.g., computer, mobile device, etc.). Thank you.
Not able to remove header from a text file while creating a data frame .
Hi All, I am reading a text file to create a dataframe . While I am trying to exclude header form the text file I am not able r to do it . Now my concern is how to know what all options are there that I can use while reading from a source , I checked the API , there the Arguments in option method are string . Could any one of you suggests how can I find all possible options ? Code :- scala> val df1 =hc.read.format("text").option("header","false").load("/development/data/swain/people.txt") df1: org.apache.spark.sql.DataFrame = [value: string] scala> df1.show() ++ | value| ++ |name|age|address|...| | N1|27|A1|1201| | N2|28|A2|1202| | N3|29|A3|1203| | N4|30|A4|1204| Thanks, ** IMPORTANT--PLEASE READ This electronic message, including its attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is intended for the authorized recipient of the sender. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying, or distribution of this message or any of the information included in it is unauthorized and strictly prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and permanently delete this message and its attachments, along with any copies thereof, from all locations received (e.g., computer, mobile device, etc.). Thank you.