I am running the same code with the same libraries but not getting same
output.
scala>  case class flight (DEST_COUNTRY_NAME: String,
     |                      ORIGIN_COUNTRY_NAME:String,
     |                      count: BigInt)
defined class flight

scala>     val flightDf =
spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
flightDf: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string,
ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> val flights = flightDf.as[flight]
flights: org.apache.spark.sql.Dataset[flight] = [DEST_COUNTRY_NAME: string,
ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME !=
"Canada").map(flight_row => flight_row).take(3)

*res0: Array[flight] = Array(flight(United States,Romania,1), flight(United
States,Ireland,264), flight(United States,India,69))*

<!------------------------------------------------------------------------------------------------------------------------------
20/03/26 19:09:00 INFO SparkContext: Running Spark version 3.0.0-preview2
20/03/26 19:09:00 INFO ResourceUtils:
==============================================================
20/03/26 19:09:00 INFO ResourceUtils: Resources for spark.driver:

20/03/26 19:09:00 INFO ResourceUtils:
==============================================================
20/03/26 19:09:00 INFO SparkContext: Submitted application:  chapter2
20/03/26 19:09:00 INFO SecurityManager: Changing view acls to: kub19
20/03/26 19:09:00 INFO SecurityManager: Changing modify acls to: kub19
20/03/26 19:09:00 INFO SecurityManager: Changing view acls groups to:
20/03/26 19:09:00 INFO SecurityManager: Changing modify acls groups to:
20/03/26 19:09:00 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(kub19);
groups with view permissions: Set(); users  with modify permissions:
Set(kub19); groups with modify permissions: Set()
20/03/26 19:09:00 INFO Utils: Successfully started service 'sparkDriver' on
port 46817.
20/03/26 19:09:00 INFO SparkEnv: Registering MapOutputTracker
20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMaster
20/03/26 19:09:00 INFO BlockManagerMasterEndpoint: Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology
information
20/03/26 19:09:00 INFO BlockManagerMasterEndpoint:
BlockManagerMasterEndpoint up
20/03/26 19:09:00 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/03/26 19:09:00 INFO DiskBlockManager: Created local directory at
/tmp/blockmgr-0baf5097-2595-4542-99e2-a192d7baf37c
20/03/26 19:09:00 INFO MemoryStore: MemoryStore started with capacity 127.2
MiB
20/03/26 19:09:01 INFO SparkEnv: Registering OutputCommitCoordinator
20/03/26 19:09:01 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.
20/03/26 19:09:01 INFO Utils: Successfully started service 'SparkUI' on
port 4041.
20/03/26 19:09:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://localhost:4041
20/03/26 19:09:01 INFO Executor: Starting executor ID driver on host
localhost
20/03/26 19:09:01 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 38135.
20/03/26 19:09:01 INFO NettyBlockTransferService: Server created on
localhost:38135
20/03/26 19:09:01 INFO BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
policy
20/03/26 19:09:01 INFO BlockManagerMaster: Registering BlockManager
BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManagerMasterEndpoint: Registering block
manager localhost:38135 with 127.2 MiB RAM, BlockManagerId(driver,
localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManagerMaster: Registered BlockManager
BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO BlockManager: Initialized BlockManager:
BlockManagerId(driver, localhost, 38135, None)
20/03/26 19:09:01 INFO SharedState: Setting hive.metastore.warehouse.dir
('null') to the value of spark.sql.warehouse.dir
('file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse').
20/03/26 19:09:01 INFO SharedState: Warehouse path is
'file:/home/kub19/spark-3.0.0-preview2-bin-hadoop2.7/projects/spark-warehouse'.
20/03/26 19:09:02 INFO SparkContext: Starting job: parquet at
chapter2.scala:18
20/03/26 19:09:02 INFO DAGScheduler: Got job 0 (parquet at
chapter2.scala:18) with 1 output partitions
20/03/26 19:09:02 INFO DAGScheduler: Final stage: ResultStage 0 (parquet at
chapter2.scala:18)
20/03/26 19:09:02 INFO DAGScheduler: Parents of final stage: List()
20/03/26 19:09:02 INFO DAGScheduler: Missing parents: List()
20/03/26 19:09:02 INFO DAGScheduler: Submitting ResultStage 0
(MapPartitionsRDD[1] at parquet at chapter2.scala:18), which has no missing
parents
20/03/26 19:09:02 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 72.8 KiB, free 127.1 MiB)
20/03/26 19:09:02 INFO MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 25.9 KiB, free 127.1 MiB)
20/03/26 19:09:02 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on localhost:38135 (size: 25.9 KiB, free: 127.2 MiB)
20/03/26 19:09:02 INFO SparkContext: Created broadcast 0 from broadcast at
DAGScheduler.scala:1206
20/03/26 19:09:02 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 0 (MapPartitionsRDD[1] at parquet at chapter2.scala:18) (first
15 tasks are for partitions Vector(0))
20/03/26 19:09:02 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/03/26 19:09:02 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7560 bytes)
20/03/26 19:09:02 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/03/26 19:09:02 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0).
1840 bytes result sent to driver
20/03/26 19:09:02 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID
0) in 204 ms on localhost (executor driver) (1/1)
20/03/26 19:09:02 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
20/03/26 19:09:02 INFO DAGScheduler: ResultStage 0 (parquet at
chapter2.scala:18) finished in 0.304 s
20/03/26 19:09:02 INFO DAGScheduler: Job 0 is finished. Cancelling
potential speculative or zombie tasks for this job
20/03/26 19:09:02 INFO TaskSchedulerImpl: Killing all running tasks in
stage 0: Stage finished
20/03/26 19:09:02 INFO DAGScheduler: Job 0 finished: parquet at
chapter2.scala:18, took 0.332643 s
20/03/26 19:09:03 INFO BlockManagerInfo: Removed broadcast_0_piece0 on
localhost:38135 in memory (size: 25.9 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO V2ScanRelationPushDown:
Pushing operators to parquet
file:/data/flight-data/parquet/2010-summary.parquet
Pushed Filters:
Post-Scan Filters:
Output: DEST_COUNTRY_NAME#0, ORIGIN_COUNTRY_NAME#1, count#2L

20/03/26 19:09:04 INFO MemoryStore: Block broadcast_1 stored as values in
memory (estimated size 290.0 KiB, free 126.9 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_1_piece0 stored as
bytes in memory (estimated size 24.3 KiB, free 126.9 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory
on localhost:38135 (size: 24.3 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 1 from take at
chapter2.scala:20
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_2 stored as values in
memory (estimated size 290.1 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_2_piece0 stored as
bytes in memory (estimated size 24.3 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on localhost:38135 (size: 24.3 KiB, free: 127.2 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 2 from take at
chapter2.scala:20
20/03/26 19:09:04 INFO CodeGenerator: Code generated in 159.155401 ms
20/03/26 19:09:04 INFO SparkContext: Starting job: take at chapter2.scala:20
20/03/26 19:09:04 INFO DAGScheduler: Got job 1 (take at chapter2.scala:20)
with 1 output partitions
20/03/26 19:09:04 INFO DAGScheduler: Final stage: ResultStage 1 (take at
chapter2.scala:20)
20/03/26 19:09:04 INFO DAGScheduler: Parents of final stage: List()
20/03/26 19:09:04 INFO DAGScheduler: Missing parents: List()
20/03/26 19:09:04 INFO DAGScheduler: Submitting ResultStage 1
(MapPartitionsRDD[5] at take at chapter2.scala:20), which has no missing
parents
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 22.7 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO MemoryStore: Block broadcast_3_piece0 stored as
bytes in memory (estimated size 8.1 KiB, free 126.6 MiB)
20/03/26 19:09:04 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on localhost:38135 (size: 8.1 KiB, free: 127.1 MiB)
20/03/26 19:09:04 INFO SparkContext: Created broadcast 3 from broadcast at
DAGScheduler.scala:1206
20/03/26 19:09:04 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 1 (MapPartitionsRDD[5] at take at chapter2.scala:20) (first 15
tasks are for partitions Vector(0))
20/03/26 19:09:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
20/03/26 19:09:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID
1, localhost, executor driver, partition 0, PROCESS_LOCAL, 7980 bytes)
20/03/26 19:09:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
20/03/26 19:09:05 INFO FilePartitionReader: Reading file path:
file:///data/flight-data/parquet/2010-summary.parquet/part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet,
range: 0-3921, partition values: [empty row]
20/03/26 19:09:05 INFO ZlibFactory: Successfully loaded & initialized
native-zlib library
20/03/26 19:09:05 INFO CodecPool: Got brand-new decompressor [.gz]
20/03/26 19:09:05 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1).
1762 bytes result sent to driver
20/03/26 19:09:05 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID
1) in 219 ms on localhost (executor driver) (1/1)
20/03/26 19:09:05 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
20/03/26 19:09:05 INFO DAGScheduler: ResultStage 1 (take at
chapter2.scala:20) finished in 0.235 s
20/03/26 19:09:05 INFO DAGScheduler: Job 1 is finished. Cancelling
potential speculative or zombie tasks for this job
20/03/26 19:09:05 INFO TaskSchedulerImpl: Killing all running tasks in
stage 1: Stage finished
20/03/26 19:09:05 INFO DAGScheduler: Job 1 finished: take at
chapter2.scala:20, took 0.238010 s
20/03/26 19:09:05 INFO CodeGenerator: Code generated in 17.77886 ms
20/03/26 19:09:05 INFO SparkUI: Stopped Spark web UI at
http://localhost:4041
20/03/26 19:09:05 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
20/03/26 19:09:05 INFO MemoryStore: MemoryStore cleared
20/03/26 19:09:05 INFO BlockManager: BlockManager stopped
20/03/26 19:09:05 INFO BlockManagerMaster: BlockManagerMaster stopped
20/03/26 19:09:05 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
20/03/26 19:09:05 INFO SparkContext: Successfully stopped SparkContext
20/03/26 19:09:05 INFO ShutdownHookManager: Shutdown hook called
20/03/26 19:09:05 INFO ShutdownHookManager: Deleting directory
/tmp/spark-6d99677e-ae1b-4894-aa32-3a79fb0b4307

Process finished with exit code 0
<!----------------------------------------------------------------------------------------------------------

import org.apache.spark.sql.SparkSession

object chapter2 {

  // define specific  data type class then manipulate it using the
filter and map functions
  // this is also known as an Encoder
  case class flight (DEST_COUNTRY_NAME: String,
                     ORIGIN_COUNTRY_NAME:String,
                     count: BigInt)

  def main(args: Array[String]): Unit = {

    // using an inter active shell, spark session needed here to avoid
Intellij errors
    val spark = SparkSession.builder.master("local[*]").appName("
chapter2").getOrCreate

    // looks like a hard coded system work around
    import spark.implicits._
    val flightDf =
spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
    val flights = flightDf.as[flight]
    flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME !=
"Canada").map(flight_row => flight_row).take(3)

    spark.stop()
  }
}







Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org
<http://www.backbutton.co.uk>

Reply via email to