Well if we think of shuffling as a necessity to perform an operation, then the
problem would be that you are adding a ln aggregation stage to a job that is
going to get shuffled anyway. Like if you need to join two datasets, then
Spark will still shuffle the data, whether they are grouped by
When debugging some behavior on my YARN cluster I wrote the following
PySpark UDF to figure out what host was operating on what row of data:
@F.udf(T.StringType())
def add_hostname(x):
import socket
return str(socket.gethostname())
It occurred to me that I could use this to enforce
Open a JIRA?
Bang Xiao 于2018年8月27日周一 上午2:46写道:
> solve the problem by create directory on hdfs before execute the sql.
> but i met a new error when i use :
>
> INSERT OVERWRITE LOCAL DIRECTORY '/search/odin/test' row format delimited
> FIELDS TERMINATED BY '\t' select vrid, query, url,
Hi Gerard:
Thank you very much!
About second question, what I wonder is that, when the batch 1
data(2018-08-27 11:04:00,1) is coming, the max event time is "2018-08-27
11:04:00", it is larger than the window + watermark of batch 0 data(2018-08-27
09:53:00, 1), so it can trigger the output
Hi Junfeng,
You should be able to do this with window aggregation functions lead or
lag
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions.html#lead
Thanks,
Dev
On Mon, Aug 27, 2018 at 7:08 AM JF Chen wrote:
> Thanks Sonal.
> For example, I have data as following:
>
solve the problem by create directory on hdfs before execute the sql.
but i met a new error when i use :
INSERT OVERWRITE LOCAL DIRECTORY '/search/odin/test' row format delimited
FIELDS TERMINATED BY '\t' select vrid, query, url, loc_city from
custom.common_wap_vr where logdate >= '2018073000'
Hi,
> 1、why the start time is "2018-08-27 09:50:00" not "2018-08-27 09:53:00"?
When I define the window, the starttime is not set.
When no 'starttime' is defined, windows are aligned to the start of the
upper time magnitude. So, if your window is defined in minutes, it will be
aligned to the
Spark needs to create a directory first, while hive can automatically create
directory.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark-2.3.0 support INSERT OVERWRITE DIRECTORY to directly write data into
the filesystem from a query.
I have met a problem with sql
"INSERT OVERWRITE DIRECTORY '/tmp/test-insert-spark' select vrid, query,
url, loc_city from custom.common_wap_vr where logdate >= '2018073000' and
logdate <=
Thanks for the reply, to make sure I got this right.
if I have 5 JSON files with 100 records in each file.
And for example, spark failed while processing the tenth record in the 3rd
file.
When the query runs again it will begin processing from the tenth record in
the 3rd file, did I get that
10 matches
Mail list logo