Hi all,
I am reading data from hdfs in the form of parquet files (around 3 GB) and
running an algorithm from the spark ml library.
If I create the same spark dataframe by reading data from S3, the same
algorithm takes considerably more time.
I don't understand why this is happening. Is this a ch
>
> You can also use s3 url but this requires a special bucket configuration,
> a dedicated empty bucket and it lacks some ineroperability with other AWS
> services.
>
> Nevertheless, it could be also something else with the code. Can you post
> an example reproducing the
ance, consider how well partitioned your HDFS file is vs the S3
> file.
>
> On Wed, May 27, 2020 at 1:51 PM Dark Crusader <
> relinquisheddra...@gmail.com> wrote:
>
>> Hi Jörn,
>>
>> Thanks for the reply. I will try to create a easier example to reproduce
>&g
is less network IO and data redistribution on the
nodes.
Thanks for your help.
Aditya
On Sat, 30 May, 2020, 10:48 am Jörn Franke, wrote:
> Maybe some aws network optimized instances with higher bandwidth will
> improve the situation.
>
> Am 27.05.2020 um 19:51 schrieb Dark Crusader >
Hi Stone,
Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
I haven't tried it with .so files however I did use the approach he
recommends to install my other dependencies.
I Hope it helps.
On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong w
Hi everyone,
I have a function which reads and writes a parquet file from HDFS. When I'm
writing a unit test for this function, I want to mock this read & write.
How do you achieve this?
Any help would be appreciated. Thank you.
Sorry I wasn't very clear in my last email.
I have a function like this:
def main( read_file):
df = spark.read.csv(read_file)
** Some other code **
df.write.csv(path)
Which I need to write a unit test for.
Would pythons unittest mock help me here?
When I googled this, I most
Hi,
I'm having some trouble figuring out how receivers tie into spark
driver-executor structure.
Do all executors have a receiver that is blocked as soon as it
receives some stream data?
Or can multiple streams of data be taken as input into a single executor?
I have stream data coming in at ever
00ms / 100(ms / partition / receiver) * 5 receivers). If I have a total
> of 10 cores in the system. 5 of them are running receivers, The remaining 5
> must process the 50 partitions of data generated by the last second of work.
>
> And again, just to reiterate, if you are doing