Re: How to authenticate to ADLS from within spark job on the fly

2017-08-18 Thread ayan guha
It may not be as easy as you think. The rest call will happen in driver but the reads will be in the executors. On Sat, 19 Aug 2017 at 11:42 am, Imtiaz Ahmed wrote: > Hi All, > > I am building a spark library which developers will use when writing their > spark jobs to

Fwd: PageRank - 4x slower then Spark?!

2017-08-18 Thread kant kodali
-- Forwarded message -- From: Kaepke, Marc Date: Fri, Aug 18, 2017 at 10:51 AM Subject: PageRank - 4x slower then Spark?! To: "u...@flink.apache.org" Hi everyone, I compared Flink and Spark by using PageRank. I guessed Flink

How to authenticate to ADLS from within spark job on the fly

2017-08-18 Thread Imtiaz Ahmed
Hi All, I am building a spark library which developers will use when writing their spark jobs to get access to data on Azure Data Lake. But the authentication will depend on the dataset they ask for. I need to call a rest API from within spark job to get credentials and authenticate to read data

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Burak Yavuz
Hi Jacek, The way the memory sink is architected at the moment is that it either appends a row (append/update mode) or replaces all rows (complete mode). When a user specifies a checkpoint location, the guarantee Structured Streaming provides is that output sinks will not lose data and will be

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Jacek Laskowski
Hi, This is what I could find in Spark's source code about the `recoverFromCheckpointLocation` flag (that led me to explore the complete output mode for dropDuplicates operator). `recoverFromCheckpointLocation` flag is enabled by default and varies per sink (memory, console and others). *

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
My assumption is it would be similar though, in memory sink of all of your records would quickly overwhelm your cluster, but in aggregation it could be reasonable. But there might be additional reasons on top of that. On Fri, Aug 18, 2017 at 11:44 AM Holden Karau wrote: >

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
Ah yes I'm not sure about the workings of the memory sink. On Fri, Aug 18, 2017 at 11:36 AM, Jacek Laskowski wrote: > Hi Holden, > > Thanks a lot for a bit more light on the topic. That however does not > explain why memory sink requires Complete for a checkpoint location to >

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Jacek Laskowski
Hi Holden, Thanks a lot for a bit more light on the topic. That however does not explain why memory sink requires Complete for a checkpoint location to work. The only reason I used Complete output mode was to meet the requirements of memory sink and that got me thinking why would the

Failing jobs with Spark 2.2 running on Yarn with HDFS

2017-08-18 Thread Jan-Hendrik Zab
Hello! I've some weird problems with Spark running on top of Yarn. (Spark 2.2 on Cloudera CDH 5.12) There are a lot of "java.net.SocketException: Network is unreachable" in the executors, part of a log file: http://support.l3s.de/~zab/spark-errors.txt and the jobs also fail at rather random

Spark Web UI SSL Encryption

2017-08-18 Thread Anshuman Kumar
Hello, I have recently installed Sparks 2.2.0, and trying to use it for some big data processing. Spark is installed on a server that I access from a remote computer. I need to setup SSL encryption for the Spark web UI, but following some threads online I’m still not able to set it up. Can

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
So performing complete output without an aggregation would require building up a table of the entire input to write out at each micro batch. This would get prohibitively expensive quickly. With an aggregation we just need to keep track of the aggregates and update them every batch, so the memory

Re: GC overhead exceeded

2017-08-18 Thread KhajaAsmath Mohammed
Hi Pat, I am using dynamic scheduling with executor memory of 8 gb . Will check to do static scheduling by giving number of executor and cores. Thanks, Asmath Sent from my iPhone > On Aug 18, 2017, at 10:39 AM, Patrick Alwell wrote: > > +1 what is the executor

Re: GC overhead exceeded

2017-08-18 Thread Patrick Alwell
+1 what is the executor memory? You may need to adjust executor memory and cores. For the sake of simplicity; each executor can handle 5 concurrent tasks and should have 5 cores. So if your cluster has 100 cores, you’d have 20 executors. And if your cluster memory is 500gb, each executor would

Persist performace in Spark

2017-08-18 Thread KhajaAsmath Mohammed
Hi, I am using persit before inserting dataframe data back into hive. This step is adding 8 minutes to my total execution time. is there a way to reduce the total time without resulting in out of memory issues. Here is my code. val datapoint_df: Dataset[Row] =

[SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Jacek Laskowski
Hi, Why is the requirement for a streaming aggregation in a streaming query? What would happen if Spark allowed Complete without a single aggregation? This is the latest master. scala> val q = ids. | writeStream. | format("memory"). | queryName("dups"). |

Re: GC overhead exceeded

2017-08-18 Thread KhajaAsmath Mohammed
It is just a sql from hive table with transformation if adding 10 more columns calculated for currency. Input size for this query is 2 months which has around 450gb data. I added persist but it didn't help. Also the executor memory is 8g . Any suggestions please ? Sent from my iPhone > On

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-18 Thread Jörn Franke
You have forgotten a y: It must be MM/did/ > On 17. Aug 2017, at 21:30, Aakash Basu wrote: > > Hi Palwell, > > Tried doing that, but its becoming null for all the dates after the > transformation with functions. > > df2 =