Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi all, I am reading data from hdfs in the form of parquet files (around 3 GB) and running an algorithm from the spark ml library. If I create the same spark dataframe by reading data from S3, the same algorithm takes considerably more time. I don't understand why this is happening. Is this a ch

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
> > You can also use s3 url but this requires a special bucket configuration, > a dedicated empty bucket and it lacks some ineroperability with other AWS > services. > > Nevertheless, it could be also something else with the code. Can you post > an example reproducing the

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
ance, consider how well partitioned your HDFS file is vs the S3 > file. > > On Wed, May 27, 2020 at 1:51 PM Dark Crusader < > relinquisheddra...@gmail.com> wrote: > >> Hi Jörn, >> >> Thanks for the reply. I will try to create a easier example to reproduce >&g

Re: Spark dataframe hdfs vs s3

2020-05-30 Thread Dark Crusader
is less network IO and data redistribution on the nodes. Thanks for your help. Aditya On Sat, 30 May, 2020, 10:48 am Jörn Franke, wrote: > Maybe some aws network optimized instances with higher bandwidth will > improve the situation. > > Am 27.05.2020 um 19:51 schrieb Dark Crusader >

Re: Add python library with native code

2020-06-05 Thread Dark Crusader
Hi Stone, Have you looked into this article? https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987 I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies. I Hope it helps. On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong w

Mocking pyspark read writes

2020-07-07 Thread Dark Crusader
Hi everyone, I have a function which reads and writes a parquet file from HDFS. When I'm writing a unit test for this function, I want to mock this read & write. How do you achieve this? Any help would be appreciated. Thank you.

Mock spark reads and writes

2020-07-14 Thread Dark Crusader
Sorry I wasn't very clear in my last email. I have a function like this: def main( read_file): df = spark.read.csv(read_file) ** Some other code ** df.write.csv(path) Which I need to write a unit test for. Would pythons unittest mock help me here? When I googled this, I most

Spark streaming receivers

2020-08-08 Thread Dark Crusader
Hi, I'm having some trouble figuring out how receivers tie into spark driver-executor structure. Do all executors have a receiver that is blocked as soon as it receives some stream data? Or can multiple streams of data be taken as input into a single executor? I have stream data coming in at ever

Re: Spark streaming receivers

2020-08-09 Thread Dark Crusader
00ms / 100(ms / partition / receiver) * 5 receivers). If I have a total > of 10 cores in the system. 5 of them are running receivers, The remaining 5 > must process the 50 partitions of data generated by the last second of work. > > And again, just to reiterate, if you are doing