RE: Join over many small files

2015-09-24 Thread Tracewski, Lukasz
be much welcomed. Thanks! Lucas From: ayan guha [mailto:guha.a...@gmail.com] Sent: 24 September 2015 00:19 To: Tracewski, Lukasz (KFDB 3) Cc: user@spark.apache.org Subject: Re: Join over many small files I think this can be a good case for using sequence file format to pack many files to few

Re: Join over many small files

2015-09-23 Thread ayan guha
I think this can be a good case for using sequence file format to pack many files to few sequence files with file name as key andd content as value. Then read it as RDD and produce tuples like you mentioned (key=fileno+id, value=value). After that, it is a simple map operation to generate the diff

Join over many small files

2015-09-23 Thread Tracewski, Lukasz
Hi all, I would like you to ask for an advise on how to efficiently make a join operation in Spark with tens of thousands of tiny files. A single file has a few KB and ~50 rows. In another scenario they might have 200 KB and 2000 rows. To give you impression how they look like: File 01 ID | VA