Re:Re:cache table vs. parquet table performance

2019-01-15 Thread 大啊
So I think cache large data is not a best practice. At 2019-01-16 12:24:34, "大啊" wrote: Hi ,Tomas. Thanks for your question give me some prompt.But the best way use cache usually stores smaller data. I think cache large data will consume memory or disk space too much. Spill the cached data in

Re:cache table vs. parquet table performance

2019-01-15 Thread 大啊
Hi ,Tomas. Thanks for your question give me some prompt.But the best way use cache usually stores smaller data. I think cache large data will consume memory or disk space too much. Spill the cached data in parquet format maybe a good improvement. At 2019-01-16 02:20:56, "Tomas Bartalos" wrote:

RE: dataset best practice question

2019-01-15 Thread kevin.r.mellott
Hi Mohit, I’m not sure that there is a “correct” answer here, but I tend to use classes whenever the input or output data represents something meaningful (such as a domain model object). I would recommend against creating many temporary classes for each and every transformation step as that

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jiaan Geng
Glad to hear this. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
Congrats, Great work Dongjoon. Dongjoon Hyun 于2019年1月15日周二 下午3:47写道: > We are happy to announce the availability of Spark 2.2.3! > > Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2 > maintenance branch of Spark. We strongly recommend all 2.2.x users to > upgrade to this

Re: How to force-quit a Spark application?

2019-01-15 Thread Marcelo Vanzin
You should check the active threads in your app. Since your pool uses non-daemon threads, that will prevent the app from exiting. spark.stop() should have stopped the Spark jobs in other threads, at least. But if something is blocking one of those threads, or if something is creating a non-daemon

How to force-quit a Spark application?

2019-01-15 Thread Pola Yao
I submitted a Spark job through ./spark-submit command, the code was executed successfully, however, the application got stuck when trying to quit spark. My code snippet: ''' { val spark = SparkSession.builder.master(...).getOrCreate val pool = Executors.newFixedThreadPool(3) implicit val xc =

dataset best practice question

2019-01-15 Thread Mohit Jaggi
Fellow Spark Coders, I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this: df_a= read_csv df_b = df.withColumn ( some_transform_that_adds_more_columns ) //repeat the above several times With datasets, this will require defining

cache table vs. parquet table performance

2019-01-15 Thread Tomas Bartalos
Hello, I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB. I tried to make some improvement with cache table: cache table event_jan_01

DFS Pregel performance vs simple Java DFS implementation

2019-01-15 Thread daveb
Hi, Considering a directed graph with 15,000 vertices and 14,000 edges, I wonder why GraphX (Pregel) takes much more time than the java implementation of a graph to get all the vertices from a vertex to the leaf? By the nature of the graph, we can almost consider it as a tree. The java

SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng
Hi all, I want to re-send the previous SPIP on introducing a DataFrame-based graph component to collect more feedback. It supports property graphs, Cypher graph queries, and graph algorithms built on top of the DataFrame API. If you are a GraphX user or your workload is essentially graph queries,

SparkSql query on a port and peocess queries

2019-01-15 Thread Soheil Pourbafrani
Hi, In my problem data is stored on both Database and HDFS. I create an application that according to the query, Spark load data, process the query and return the answer. I'm looking for a service that gets SQL queries and returns the answers (like Databases command line). Is there a way that my