Re: Pitfalls of partitioning by host?

2018-08-27 Thread Michael Artz
Well if we think of shuffling as a necessity to perform an operation, then the problem would be that you are adding a ln aggregation stage to a job that is going to get shuffled anyway. Like if you need to join two datasets, then Spark will still shuffle the data, whether they are grouped by

Pitfalls of partitioning by host?

2018-08-27 Thread Patrick McCarthy
When debugging some behavior on my YARN cluster I wrote the following PySpark UDF to figure out what host was operating on what row of data: @F.udf(T.StringType()) def add_hostname(x): import socket return str(socket.gethostname()) It occurred to me that I could use this to enforce

Re: How to use 'insert overwrite [local] directory' correctly?

2018-08-27 Thread Xiao Li
Open a JIRA? Bang Xiao 于2018年8月27日周一 上午2:46写道: > solve the problem by create directory on hdfs before execute the sql. > but i met a new error when i use : > > INSERT OVERWRITE LOCAL DIRECTORY '/search/odin/test' row format delimited > FIELDS TERMINATED BY '\t' select vrid, query, url,

Re: Re: About the question of Spark Structured Streaming window output

2018-08-27 Thread z...@zjdex.com
Hi Gerard: Thank you very much! About second question, what I wonder is that, when the batch 1 data(2018-08-27 11:04:00,1) is coming, the max event time is "2018-08-27 11:04:00", it is larger than the window + watermark of batch 0 data(2018-08-27 09:53:00, 1), so it can trigger the output

Re: How to deal with context dependent computing?

2018-08-27 Thread devjyoti patra
Hi Junfeng, You should be able to do this with window aggregation functions lead or lag https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions.html#lead Thanks, Dev On Mon, Aug 27, 2018 at 7:08 AM JF Chen wrote: > Thanks Sonal. > For example, I have data as following: >

Re: How to use 'insert overwrite [local] directory' correctly?

2018-08-27 Thread Bang Xiao
solve the problem by create directory on hdfs before execute the sql. but i met a new error when i use : INSERT OVERWRITE LOCAL DIRECTORY '/search/odin/test' row format delimited FIELDS TERMINATED BY '\t' select vrid, query, url, loc_city from custom.common_wap_vr where logdate >= '2018073000'

Re: Re: About the question of Spark Structured Streaming window output

2018-08-27 Thread Gerard Maas
Hi, > 1、why the start time is "2018-08-27 09:50:00" not "2018-08-27 09:53:00"? When I define the window, the starttime is not set. When no 'starttime' is defined, windows are aligned to the start of the upper time magnitude. So, if your window is defined in minutes, it will be aligned to the

Re: How to use 'insert overwrite [local] directory' correctly?

2018-08-27 Thread Bang Xiao
Spark needs to create a directory first, while hive can automatically create directory. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

How to use 'insert overwrite [local] directory' correctly?

2018-08-27 Thread Bang Xiao
Spark-2.3.0 support INSERT OVERWRITE DIRECTORY to directly write data into the filesystem from a query. I have met a problem with sql "INSERT OVERWRITE DIRECTORY '/tmp/test-insert-spark' select vrid, query, url, loc_city from custom.common_wap_vr where logdate >= '2018073000' and logdate <=

Re: Spark Structured Streaming using S3 as data source

2018-08-27 Thread Sherif Hamdy
Thanks for the reply, to make sure I got this right. if I have 5 JSON files with 100 records in each file. And for example, spark failed while processing the tenth record in the 3rd file. When the query runs again it will begin processing from the tenth record in the 3rd file, did I get that