Re:Re:cache table vs. parquet table performance

2019-01-15 Thread 大啊
So I think cache large data is not a best practice.


At 2019-01-16 12:24:34, "大啊"  wrote:

Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually 
stores smaller data.
I think cache large data will consume memory or disk space too much.
Spill the cached data in parquet format maybe a good improvement.


At 2019-01-16 02:20:56, "Tomas Bartalos"  wrote:

Hello,


I'm using spark-thrift server and I'm searching for best performing solution to 
query hot set of data. I'm processing records with nested structure, containing 
subtypes and arrays. 1 record takes up several KB.


I tried to make some improvement with cache table:

cache table event_jan_01 asselect * from events where day_registered = 20190102;




If I understood correctly, the data should be stored in in-memory columnar 
format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory 
will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none 
of the data was cached to memory and everything was spilled to disk. The size 
of the data was 5.7 GB.
Typical queries took ~ 20 sec.


Then I tried to store the data to parquet format:

CREATETABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"as 

select * from event_jan_01;




The whole parquet took up only 178MB.
And typical queries took 5-10 sec.


Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?


Spark version: 2.4.0


Best regards,
Tomas






 

Re:cache table vs. parquet table performance

2019-01-15 Thread 大啊
Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually 
stores smaller data.
I think cache large data will consume memory or disk space too much.
Spill the cached data in parquet format maybe a good improvement.


At 2019-01-16 02:20:56, "Tomas Bartalos"  wrote:

Hello,


I'm using spark-thrift server and I'm searching for best performing solution to 
query hot set of data. I'm processing records with nested structure, containing 
subtypes and arrays. 1 record takes up several KB.


I tried to make some improvement with cache table:

cache table event_jan_01 asselect * from events where day_registered = 20190102;




If I understood correctly, the data should be stored in in-memory columnar 
format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory 
will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none 
of the data was cached to memory and everything was spilled to disk. The size 
of the data was 5.7 GB.
Typical queries took ~ 20 sec.


Then I tried to store the data to parquet format:

CREATETABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"as 

select * from event_jan_01;




The whole parquet took up only 178MB.
And typical queries took 5-10 sec.


Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?


Spark version: 2.4.0


Best regards,
Tomas



RE: dataset best practice question

2019-01-15 Thread kevin.r.mellott
Hi Mohit,

 

I’m not sure that there is a “correct” answer here, but I tend to use classes 
whenever the input or output data represents something meaningful (such as a 
domain model object). I would recommend against creating many temporary classes 
for each and every transformation step as that may be difficult to maintain 
over time.

 

Using withColumn statements will continue to work, and you don’t need to cast 
to your output class until you’ve setup all tranformations. Therefore, you can 
do things like:

 

case class A (f1, f2, f3)

case class B (f1, f2, f3, f4, f5, f6)

 

ds_a = spark.read.csv(“path”).as[A]

ds_b = ds_a

  .withColumn(“f4”, someUdf)

  .withColumn(“f5”, someUdf)

  .withColumn(“f6”, someUdf)

  .as[B]

 

Kevin

 

From: Mohit Jaggi  
Sent: Tuesday, January 15, 2019 1:31 PM
To: user 
Subject: dataset best practice question

 

Fellow Spark Coders,

I am trying to move from using Dataframes to Datasets for a reasonably large 
code base. Today the code looks like this:

 

df_a= read_csv

df_b = df.withColumn ( some_transform_that_adds_more_columns )

//repeat the above several times

 

With datasets, this will require defining

 

case class A { f1, f2, f3 } //fields from csv file

case class B { f1, f2, f3, f4 } //union of A and new field added by 
some_transform_that_adds_more_columns

//repeat this 10 times

 

Is there a better way? 

 

Mohit.



Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jiaan Geng
Glad to hear this.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
Congrats, Great work Dongjoon.



Dongjoon Hyun  于2019年1月15日周二 下午3:47写道:

> We are happy to announce the availability of Spark 2.2.3!
>
> Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
> maintenance branch of Spark. We strongly recommend all 2.2.x users to
> upgrade to this stable release.
>
> To download Spark 2.2.3, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-2-3.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
> Bests,
> Dongjoon.
>


-- 
Best Regards

Jeff Zhang


Re: How to force-quit a Spark application?

2019-01-15 Thread Marcelo Vanzin
You should check the active threads in your app. Since your pool uses
non-daemon threads, that will prevent the app from exiting.

spark.stop() should have stopped the Spark jobs in other threads, at
least. But if something is blocking one of those threads, or if
something is creating a non-daemon thread that stays alive somewhere,
you'll see that.

Or you can force quit with sys.exit.

On Tue, Jan 15, 2019 at 1:30 PM Pola Yao  wrote:
>
> I submitted a Spark job through ./spark-submit command, the code was executed 
> successfully, however, the application got stuck when trying to quit spark.
>
> My code snippet:
> '''
> {
>
> val spark = SparkSession.builder.master(...).getOrCreate
>
> val pool = Executors.newFixedThreadPool(3)
> implicit val xc = ExecutionContext.fromExecutorService(pool)
> val taskList = List(train1, train2, train3)  // where train* is a Future 
> function which wrapped up some data reading and feature engineering and 
> machine learning steps
> val results = Await.result(Future.sequence(taskList), 20 minutes)
>
> println("Shutting down pool and executor service")
> pool.shutdown()
> xc.shutdown()
>
> println("Exiting spark")
> spark.stop()
>
> }
> '''
>
> After I submitted the job, from terminal, I could see the code was executed 
> and printing "Exiting spark", however, after printing that line, it never 
> existed spark, just got stuck.
>
> Does any body know what the reason is? Or how to force quitting?
>
> Thanks!
>
>


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to force-quit a Spark application?

2019-01-15 Thread Pola Yao
I submitted a Spark job through ./spark-submit command, the code was
executed successfully, however, the application got stuck when trying to
quit spark.

My code snippet:
'''
{

val spark = SparkSession.builder.master(...).getOrCreate

val pool = Executors.newFixedThreadPool(3)
implicit val xc = ExecutionContext.fromExecutorService(pool)
val taskList = List(train1, train2, train3)  // where train* is a Future
function which wrapped up some data reading and feature engineering and
machine learning steps
val results = Await.result(Future.sequence(taskList), 20 minutes)

println("Shutting down pool and executor service")
pool.shutdown()
xc.shutdown()

println("Exiting spark")
spark.stop()

}
'''

After I submitted the job, from terminal, I could see the code was executed
and printing "Exiting spark", however, after printing that line, it never
existed spark, just got stuck.

Does any body know what the reason is? Or how to force quitting?

Thanks!


dataset best practice question

2019-01-15 Thread Mohit Jaggi
Fellow Spark Coders,
I am trying to move from using Dataframes to Datasets for a reasonably
large code base. Today the code looks like this:

df_a= read_csv
df_b = df.withColumn ( some_transform_that_adds_more_columns )
//repeat the above several times

With datasets, this will require defining

case class A { f1, f2, f3 } //fields from csv file
case class B { f1, f2, f3, f4 } //union of A and new field added by
some_transform_that_adds_more_columns
//repeat this 10 times

Is there a better way?

Mohit.


cache table vs. parquet table performance

2019-01-15 Thread Tomas Bartalos
Hello,

I'm using spark-thrift server and I'm searching for best performing
solution to query hot set of data. I'm processing records with nested
structure, containing subtypes and arrays. 1 record takes up several KB.

I tried to make some improvement with cache table:

cache table event_jan_01 as select * from events where day_registered =
20190102;


If I understood correctly, the data should be stored in *in-memory columnar*
format with storage level MEMORY_AND_DISK. So data which doesn't fit to
memory will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab
none of the data was cached to memory and everything was spilled to disk.
The size of the data was *5.7 GB.*
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as


select * from event_jan_01;


The whole parquet took up only *178MB.*
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory
?

Spark version: 2.4.0

Best regards,
Tomas


DFS Pregel performance vs simple Java DFS implementation

2019-01-15 Thread daveb
Hi,

Considering a directed graph with 15,000 vertices and 14,000 edges, I wonder
why GraphX (Pregel) takes much more time than the java implementation of a
graph to get all the vertices from a vertex to the leaf?
By the nature of the graph, we can almost consider it as a tree.

The java implementation: A few seconds
The Graphx implementation: Several minutes

Is Pregel really suitable for this kind of treatment?

Additionnal informations:
My system: 16 cores, 35GB RAM

The pregel's algorythm:
val sourceId: VertexId = 42 // The ultimate source
  // Initialize the graph such that all vertices except the root have
canReach = false.
  val initialGraph: Graph[Boolean, Double]  = graph.mapVertices((id, _) =>
id == sourceId)
  val dfs = initialGraph.pregel(false)(
(id, canReach, newCanReach) => canReach || newCanReach, // Vertex
Program
triplet => {  // Send Message
  if (triplet.srcAttr && !triplet.dstAttr) {
Iterator((triplet.dstId, true))
  } else {
Iterator.empty
  }
},
(a, b) => a || b // Merge Message

Thanks 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng
Hi all,

I want to re-send the previous SPIP on introducing a DataFrame-based graph
component to collect more feedback. It supports property graphs, Cypher
graph queries, and graph algorithms built on top of the DataFrame API. If
you are a GraphX user or your workload is essentially graph queries, please
help review and check how it fits into your use cases. Your feedback would
be greatly appreciated!

# Links to SPIP and design sketch:

* Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-25994
* Google Doc:
https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit?usp=sharing
* Jira issue for a first design sketch:
https://issues.apache.org/jira/browse/SPARK-26028
* Google Doc:
https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing

# Sample code:

~~~
val graph = ...

// query
val result = graph.cypher("""
  MATCH (p:Person)-[r:STUDY_AT]->(u:University)
  RETURN p.name, r.since, u.name
""")

// algorithms
val ranks = graph.pageRank.run()
~~~

Best,
Xiangrui


SparkSql query on a port and peocess queries

2019-01-15 Thread Soheil Pourbafrani
Hi,
In my problem data is stored on both Database and HDFS. I create an
application that according to the query,  Spark load data, process the
query and return the answer.

I'm looking for a service that gets SQL queries and returns the answers
(like Databases command line). Is there a way that my application listen on
a port and get the query and return the answer, there?