I am running the same code with the same libraries but not getting same output. scala> case class flight (DEST_COUNTRY_NAME: String, | ORIGIN_COUNTRY_NAME:String, | count: BigInt) defined class flight
scala> val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/") flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field] scala> val flights = flightDf.as[flight] flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field] scala> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3) *res0: Array[flight] = Array(flight(United States,Romania,1), flight(United States,Ireland,264), flight(United States,India,69))* <!------------------------------------------------------------------------------------------------------------------------------ 20/03/26 19:09:00 INFO SparkContext: Running Spark version 3.0.0-preview2 20/03/26 19:09:00 INFO ResourceUtils: ============================================================== 20/03/26 19:09:00 INFO ResourceUtils: Resources for spark.driver: 20/03/26 19:09:00 INFO ResourceUtils: ============================================================== 20/03/26 19:09:00 INFO SparkContext: Submitted application: chapter2 20/03/26 19:09:00 INFO SecurityManager: Changing view acls to: kub19 20/03/26 19:09:00 INFO SecurityManager: Changing modify acls to: kub19 20/03/26 19:09:00 INFO SecurityManager: Changing view acls groups to: 20/03/26 19:09:00 INFO SecurityManager: Changing modify acls groups to: 20/03/26 19:09:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(kub19); groups with view permissions: Set(); users with modify permissions: Set(kub19); groups with modify permissions: Set() 20/03/26 19:09:00 INFO Utils: Successfully started service 'sparkDriver' on port 46817. 20/03/26 19:09:00 INFO SparkEnv: Registering MapOutputTracker 20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMaster 20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 20/03/26 19:09:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-0baf5097-2595-4542-99e2-a192d7baf37c 20/03/26 19:09:00 INFO MemoryStore: MemoryStore started with capacity 127.2 MiB 20/03/26 19:09:01 INFO SparkEnv: Registering OutputCommitCoordinator 20/03/26 19:09:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 20/03/26 19:09:01 INFO Utils: Successfully started service 'SparkUI' on port 4041. 20/03/26 19:09:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4041 20/03/26 19:09:01 INFO Executor: Starting executor ID driver on host localhost 20/03/26 19:09:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38135. 20/03/26 19:09:01 INFO NettyBlockTransferService: Server created on localhost:38135 20/03/26 19:09:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/03/26 19:09:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 38135, None) 20/03/26 19:09:01 INFO BlockManagerMasterEndpoint: Registering block manager localhost:38135 with 127.2 MiB RAM, BlockManagerId(driver, localhost, 38135, None) 20/03/26 19:09:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 38135, None) 20/03/26 19:09:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 38135, None) 20/03/26 19:09:01 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse'). 20/03/26 19:09:01 INFO SharedState: Warehouse path is 'file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse'. 20/03/26 19:09:02 INFO SparkContext: Starting job: parquet at chapter2.scala:18 20/03/26 19:09:02 INFO DAGScheduler: Got job 0 (parquet at chapter2.scala:18) with 1 output partitions 20/03/26 19:09:02 INFO DAGScheduler: Final stage: ResultStage 0 (parquet at chapter2.scala:18) 20/03/26 19:09:02 INFO DAGScheduler: Parents of final stage: List() 20/03/26 19:09:02 INFO DAGScheduler: Missing parents: List() 20/03/26 19:09:02 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at parquet at chapter2.scala:18), which has no missing parents 20/03/26 19:09:02 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 72.8 KiB, free 127.1 MiB) 20/03/26 19:09:02 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 25.9 KiB, free 127.1 MiB) 20/03/26 19:09:02 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:38135 (size: 25.9 KiB, free: 127.2 MiB) 20/03/26 19:09:02 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1206 20/03/26 19:09:02 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at parquet at chapter2.scala:18) (first 15 tasks are for partitions Vector(0)) 20/03/26 19:09:02 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 20/03/26 19:09:02 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7560 bytes) 20/03/26 19:09:02 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 20/03/26 19:09:02 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1840 bytes result sent to driver 20/03/26 19:09:02 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 204 ms on localhost (executor driver) (1/1) 20/03/26 19:09:02 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 20/03/26 19:09:02 INFO DAGScheduler: ResultStage 0 (parquet at chapter2.scala:18) finished in 0.304 s 20/03/26 19:09:02 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 20/03/26 19:09:02 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 20/03/26 19:09:02 INFO DAGScheduler: Job 0 finished: parquet at chapter2.scala:18, took 0.332643 s 20/03/26 19:09:03 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:38135 in memory (size: 25.9 KiB, free: 127.2 MiB) 20/03/26 19:09:04 INFO V2ScanRelationPushDown: Pushing operators to parquet file:/data/flight-data/parquet/2010-summary.parquet Pushed Filters: Post-Scan Filters: Output: DEST_COUNTRY_NAME#0, ORIGIN_COUNTRY_NAME#1, count#2L 20/03/26 19:09:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 290.0 KiB, free 126.9 MiB) 20/03/26 19:09:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 24.3 KiB, free 126.9 MiB) 20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:38135 (size: 24.3 KiB, free: 127.2 MiB) 20/03/26 19:09:04 INFO SparkContext: Created broadcast 1 from take at chapter2.scala:20 20/03/26 19:09:04 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 290.1 KiB, free 126.6 MiB) 20/03/26 19:09:04 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 24.3 KiB, free 126.6 MiB) 20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:38135 (size: 24.3 KiB, free: 127.2 MiB) 20/03/26 19:09:04 INFO SparkContext: Created broadcast 2 from take at chapter2.scala:20 20/03/26 19:09:04 INFO CodeGenerator: Code generated in 159.155401 ms 20/03/26 19:09:04 INFO SparkContext: Starting job: take at chapter2.scala:20 20/03/26 19:09:04 INFO DAGScheduler: Got job 1 (take at chapter2.scala:20) with 1 output partitions 20/03/26 19:09:04 INFO DAGScheduler: Final stage: ResultStage 1 (take at chapter2.scala:20) 20/03/26 19:09:04 INFO DAGScheduler: Parents of final stage: List() 20/03/26 19:09:04 INFO DAGScheduler: Missing parents: List() 20/03/26 19:09:04 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at take at chapter2.scala:20), which has no missing parents 20/03/26 19:09:04 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 22.7 KiB, free 126.6 MiB) 20/03/26 19:09:04 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 8.1 KiB, free 126.6 MiB) 20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:38135 (size: 8.1 KiB, free: 127.1 MiB) 20/03/26 19:09:04 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1206 20/03/26 19:09:04 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at take at chapter2.scala:20) (first 15 tasks are for partitions Vector(0)) 20/03/26 19:09:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 20/03/26 19:09:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 7980 bytes) 20/03/26 19:09:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 20/03/26 19:09:05 INFO FilePartitionReader: Reading file path: file:///data/flight-data/parquet/2010-summary.parquet/part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet, range: 0-3921, partition values: [empty row] 20/03/26 19:09:05 INFO ZlibFactory: Successfully loaded & initialized native-zlib library 20/03/26 19:09:05 INFO CodecPool: Got brand-new decompressor [.gz] 20/03/26 19:09:05 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1762 bytes result sent to driver 20/03/26 19:09:05 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 219 ms on localhost (executor driver) (1/1) 20/03/26 19:09:05 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 20/03/26 19:09:05 INFO DAGScheduler: ResultStage 1 (take at chapter2.scala:20) finished in 0.235 s 20/03/26 19:09:05 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job 20/03/26 19:09:05 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished 20/03/26 19:09:05 INFO DAGScheduler: Job 1 finished: take at chapter2.scala:20, took 0.238010 s 20/03/26 19:09:05 INFO CodeGenerator: Code generated in 17.77886 ms 20/03/26 19:09:05 INFO SparkUI: Stopped Spark web UI at http://localhost:4041 20/03/26 19:09:05 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 20/03/26 19:09:05 INFO MemoryStore: MemoryStore cleared 20/03/26 19:09:05 INFO BlockManager: BlockManager stopped 20/03/26 19:09:05 INFO BlockManagerMaster: BlockManagerMaster stopped 20/03/26 19:09:05 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 20/03/26 19:09:05 INFO SparkContext: Successfully stopped SparkContext 20/03/26 19:09:05 INFO ShutdownHookManager: Shutdown hook called 20/03/26 19:09:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-6d99677e-ae1b-4894-aa32-3a79fb0b4307 Process finished with exit code 0 <!---------------------------------------------------------------------------------------------------------- import org.apache.spark.sql.SparkSession object chapter2 { // define specific data type class then manipulate it using the filter and map functions // this is also known as an Encoder case class flight (DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME:String, count: BigInt) def main(args: Array[String]): Unit = { // using an inter active shell, spark session needed here to avoid Intellij errors val spark = SparkSession.builder.master("local[*]").appName(" chapter2").getOrCreate // looks like a hard coded system work around import spark.implicits._ val flightDf = spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/") val flights = flightDf.as[flight] flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(3) spark.stop() } } Backbutton.co.uk ¯\_(ツ)_/¯ ♡۶Java♡۶RMI ♡۶ Make Use Method {MUM} makeuse.org <http://www.backbutton.co.uk>