be much welcomed.
Thanks!
Lucas
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 24 September 2015 00:19
To: Tracewski, Lukasz (KFDB 3)
Cc: user@spark.apache.org
Subject: Re: Join over many small files
I think this can be a good case for using sequence file format to pack many
files to few
I think this can be a good case for using sequence file format to pack many
files to few sequence files with file name as key andd content as value.
Then read it as RDD and produce tuples like you mentioned (key=fileno+id,
value=value). After that, it is a simple map operation to generate the diff
Hi all,
I would like you to ask for an advise on how to efficiently make a join
operation in Spark with tens of thousands of tiny files. A single file has a
few KB and ~50 rows. In another scenario they might have 200 KB and 2000 rows.
To give you impression how they look like:
File 01
ID | VA