Re:Re:cache table vs. parquet table performance
So I think cache large data is not a best practice. At 2019-01-16 12:24:34, "大啊" wrote: Hi ,Tomas. Thanks for your question give me some prompt.But the best way use cache usually stores smaller data. I think cache large data will consume memory or disk space too much. Spill the cached data in parquet format maybe a good improvement. At 2019-01-16 02:20:56, "Tomas Bartalos" wrote: Hello, I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB. I tried to make some improvement with cache table: cache table event_jan_01 asselect * from events where day_registered = 20190102; If I understood correctly, the data should be stored in in-memory columnar format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?)) I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was 5.7 GB. Typical queries took ~ 20 sec. Then I tried to store the data to parquet format: CREATETABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"as select * from event_jan_01; The whole parquet took up only 178MB. And typical queries took 5-10 sec. Is it possible to tune spark to spill the cached data in parquet format ? Why the whole cached table was spilled to disk and nothing stayed in memory ? Spark version: 2.4.0 Best regards, Tomas
Re:cache table vs. parquet table performance
Hi ,Tomas. Thanks for your question give me some prompt.But the best way use cache usually stores smaller data. I think cache large data will consume memory or disk space too much. Spill the cached data in parquet format maybe a good improvement. At 2019-01-16 02:20:56, "Tomas Bartalos" wrote: Hello, I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB. I tried to make some improvement with cache table: cache table event_jan_01 asselect * from events where day_registered = 20190102; If I understood correctly, the data should be stored in in-memory columnar format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?)) I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was 5.7 GB. Typical queries took ~ 20 sec. Then I tried to store the data to parquet format: CREATETABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"as select * from event_jan_01; The whole parquet took up only 178MB. And typical queries took 5-10 sec. Is it possible to tune spark to spill the cached data in parquet format ? Why the whole cached table was spilled to disk and nothing stayed in memory ? Spark version: 2.4.0 Best regards, Tomas
RE: dataset best practice question
Hi Mohit, I’m not sure that there is a “correct” answer here, but I tend to use classes whenever the input or output data represents something meaningful (such as a domain model object). I would recommend against creating many temporary classes for each and every transformation step as that may be difficult to maintain over time. Using withColumn statements will continue to work, and you don’t need to cast to your output class until you’ve setup all tranformations. Therefore, you can do things like: case class A (f1, f2, f3) case class B (f1, f2, f3, f4, f5, f6) ds_a = spark.read.csv(“path”).as[A] ds_b = ds_a .withColumn(“f4”, someUdf) .withColumn(“f5”, someUdf) .withColumn(“f6”, someUdf) .as[B] Kevin From: Mohit Jaggi Sent: Tuesday, January 15, 2019 1:31 PM To: user Subject: dataset best practice question Fellow Spark Coders, I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this: df_a= read_csv df_b = df.withColumn ( some_transform_that_adds_more_columns ) //repeat the above several times With datasets, this will require defining case class A { f1, f2, f3 } //fields from csv file case class B { f1, f2, f3, f4 } //union of A and new field added by some_transform_that_adds_more_columns //repeat this 10 times Is there a better way? Mohit.
Re: [ANNOUNCE] Announcing Apache Spark 2.2.3
Glad to hear this. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [ANNOUNCE] Announcing Apache Spark 2.2.3
Congrats, Great work Dongjoon. Dongjoon Hyun 于2019年1月15日周二 下午3:47写道: > We are happy to announce the availability of Spark 2.2.3! > > Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2 > maintenance branch of Spark. We strongly recommend all 2.2.x users to > upgrade to this stable release. > > To download Spark 2.2.3, head over to the download page: > http://spark.apache.org/downloads.html > > To view the release notes: > https://spark.apache.org/releases/spark-release-2-2-3.html > > We would like to acknowledge all community members for contributing to > this release. This release would not have been possible without you. > > Bests, > Dongjoon. > -- Best Regards Jeff Zhang
Re: How to force-quit a Spark application?
You should check the active threads in your app. Since your pool uses non-daemon threads, that will prevent the app from exiting. spark.stop() should have stopped the Spark jobs in other threads, at least. But if something is blocking one of those threads, or if something is creating a non-daemon thread that stays alive somewhere, you'll see that. Or you can force quit with sys.exit. On Tue, Jan 15, 2019 at 1:30 PM Pola Yao wrote: > > I submitted a Spark job through ./spark-submit command, the code was executed > successfully, however, the application got stuck when trying to quit spark. > > My code snippet: > ''' > { > > val spark = SparkSession.builder.master(...).getOrCreate > > val pool = Executors.newFixedThreadPool(3) > implicit val xc = ExecutionContext.fromExecutorService(pool) > val taskList = List(train1, train2, train3) // where train* is a Future > function which wrapped up some data reading and feature engineering and > machine learning steps > val results = Await.result(Future.sequence(taskList), 20 minutes) > > println("Shutting down pool and executor service") > pool.shutdown() > xc.shutdown() > > println("Exiting spark") > spark.stop() > > } > ''' > > After I submitted the job, from terminal, I could see the code was executed > and printing "Exiting spark", however, after printing that line, it never > existed spark, just got stuck. > > Does any body know what the reason is? Or how to force quitting? > > Thanks! > > -- Marcelo - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
How to force-quit a Spark application?
I submitted a Spark job through ./spark-submit command, the code was executed successfully, however, the application got stuck when trying to quit spark. My code snippet: ''' { val spark = SparkSession.builder.master(...).getOrCreate val pool = Executors.newFixedThreadPool(3) implicit val xc = ExecutionContext.fromExecutorService(pool) val taskList = List(train1, train2, train3) // where train* is a Future function which wrapped up some data reading and feature engineering and machine learning steps val results = Await.result(Future.sequence(taskList), 20 minutes) println("Shutting down pool and executor service") pool.shutdown() xc.shutdown() println("Exiting spark") spark.stop() } ''' After I submitted the job, from terminal, I could see the code was executed and printing "Exiting spark", however, after printing that line, it never existed spark, just got stuck. Does any body know what the reason is? Or how to force quitting? Thanks!
dataset best practice question
Fellow Spark Coders, I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this: df_a= read_csv df_b = df.withColumn ( some_transform_that_adds_more_columns ) //repeat the above several times With datasets, this will require defining case class A { f1, f2, f3 } //fields from csv file case class B { f1, f2, f3, f4 } //union of A and new field added by some_transform_that_adds_more_columns //repeat this 10 times Is there a better way? Mohit.
cache table vs. parquet table performance
Hello, I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB. I tried to make some improvement with cache table: cache table event_jan_01 as select * from events where day_registered = 20190102; If I understood correctly, the data should be stored in *in-memory columnar* format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?)) I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was *5.7 GB.* Typical queries took ~ 20 sec. Then I tried to store the data to parquet format: CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as select * from event_jan_01; The whole parquet took up only *178MB.* And typical queries took 5-10 sec. Is it possible to tune spark to spill the cached data in parquet format ? Why the whole cached table was spilled to disk and nothing stayed in memory ? Spark version: 2.4.0 Best regards, Tomas
DFS Pregel performance vs simple Java DFS implementation
Hi, Considering a directed graph with 15,000 vertices and 14,000 edges, I wonder why GraphX (Pregel) takes much more time than the java implementation of a graph to get all the vertices from a vertex to the leaf? By the nature of the graph, we can almost consider it as a tree. The java implementation: A few seconds The Graphx implementation: Several minutes Is Pregel really suitable for this kind of treatment? Additionnal informations: My system: 16 cores, 35GB RAM The pregel's algorythm: val sourceId: VertexId = 42 // The ultimate source // Initialize the graph such that all vertices except the root have canReach = false. val initialGraph: Graph[Boolean, Double] = graph.mapVertices((id, _) => id == sourceId) val dfs = initialGraph.pregel(false)( (id, canReach, newCanReach) => canReach || newCanReach, // Vertex Program triplet => { // Send Message if (triplet.srcAttr && !triplet.dstAttr) { Iterator((triplet.dstId, true)) } else { Iterator.empty } }, (a, b) => a || b // Merge Message Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms
Hi all, I want to re-send the previous SPIP on introducing a DataFrame-based graph component to collect more feedback. It supports property graphs, Cypher graph queries, and graph algorithms built on top of the DataFrame API. If you are a GraphX user or your workload is essentially graph queries, please help review and check how it fits into your use cases. Your feedback would be greatly appreciated! # Links to SPIP and design sketch: * Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-25994 * Google Doc: https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit?usp=sharing * Jira issue for a first design sketch: https://issues.apache.org/jira/browse/SPARK-26028 * Google Doc: https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing # Sample code: ~~~ val graph = ... // query val result = graph.cypher(""" MATCH (p:Person)-[r:STUDY_AT]->(u:University) RETURN p.name, r.since, u.name """) // algorithms val ranks = graph.pageRank.run() ~~~ Best, Xiangrui
SparkSql query on a port and peocess queries
Hi, In my problem data is stored on both Database and HDFS. I create an application that according to the query, Spark load data, process the query and return the answer. I'm looking for a service that gets SQL queries and returns the answers (like Databases command line). Is there a way that my application listen on a port and get the query and return the answer, there?