Re: writing a small csv to HDFS is super slow

2019-03-25 Thread Lian Jiang
Thanks guys for reply. The execution plan shows a giant query. After divide and conquer, saving is quick. On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama wrote: > Hi Lian, > Since you using repartition(1), do you want to decrease the number of > partitions? If so, have you tried to use coalesce

streaming - absolute maximum

2019-03-25 Thread Jason Nerothin
Hello, I wish to calculate the most recent event time from a Stream. Something like this: val timestamped = records.withColumn("ts_long", unix_timestamp($"eventTime")) val lastReport = timestamped .withWatermark("eventTime", "4 hours") .groupBy(col("eventTime"),

Re: Window function range between

2019-03-25 Thread Mich Talebzadeh
Hi, This works for me scala> val wSpec3 = Window.partitionBy('priceInfo.getItem("ticker")).orderBy('priceInfo.getItem("timeissued").desc).rangeBetween(0,5) wSpec3: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@f308c09 HTH, Is the date format correct.

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
I’m beginning to agree with you and find it rather surprising that this is mentioned nowhere explicitly (maybe I missed?). It is possible to serialize code to be executed in executors to various nodes. It also seems possible to serialize the “driver” bits of code although I’m not sure how the

Re: Where does the Driver run?

2019-03-25 Thread Andrew Melo
Hi Pat, Indeed, I don't think that it's possible to use cluster mode w/o spark-submit. All the docs I see appear to always describe needing to use spark-submit for cluster mode -- it's not even compatible with spark-shell. But it makes sense to me -- if you want Spark to run your application's

Window function range between

2019-03-25 Thread Kumar sp
Hi, I am trying to use range between window function but i am keep on getting below error main" org.apache.spark.sql.AnalysisException: Window Frame specifiedwindowframe(RangeFrame, currentrow$(), 5) must match the required frame specified I need to check next consecutive 5 seconds interval

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
In the GUI while the job is running the app-id link brings up logs to both executors, The “name” link goes to 4040 of the machine that launched the job but is not resolvable right now so the page is not shown. I’ll try the netstat but the use of port 4040 was a good clue. By what you say below