Hello Nicolas,

I solved a similar problem using FSDataOutputStream

http://blog.woopi.org/wordpress/files/hadoop-2.6.0-javadoc/org/apache/hadoop/fs/FSDataOutputStream.html

Each entry can be a ArrayWritable

http://blog.woopi.org/wordpress/files/hadoop-2.6.0-javadoc/org/apache/hadoop/io/ArrayWritable.html

so that you can put a arbitrary sequence of Writable subtypes into one
entry/element.

During my tests, best performance results could be achieved when using
binary ZLIB library. GZIP or bzip2 delivered slightly smaller files bute
the processing time was several TIMES slower in my cases.

Regarding update and delete:

As far as I know HDFS does not support update and delete. Tools like HBase
realize this by using several HDFS files and rewriting them from time to
time. Depending on the frequence you need to update / delete data, you can
think about housekeeping your HDFS file by yourself. You delete an entry by
writing a delete flag somewhere else refering to a key of your single data
entry. If you have to much deleted and you want really to cleanup the HDFS
file you can rewrite the HDFS file using MAPREDUCE for example.

In the update case you can use a similar technique. Just append the new /
updated version of your dataset and write a delete flag to the old version
somewhere else.

As a summary you should think about the complexity of your own hand my
solution in comparison with a HBase solution (like mentioned before). If
you don't like key value store databases you can also have a look to
phoenix on top of HBase which delivers a very SQL like access layer.

The only restriction I know for HBase is, that a single dataset should not
be bigger than the size of HDFS blocks.

I hope the comments help you. If you have questions don't hestitate to
cantact me.

Good luck

Martin



2015-09-03 16:17 GMT+02:00 <nib...@free.fr>:

> My main question in case of HAR usage is , is it possible to use Pig on it
> and what about performances ?
>
> ----- Mail original -----
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, user@spark.apache.org
> Envoyé: Jeudi 3 Septembre 2015 15:54:42
> Objet: Re: Small File to HDFS
>
>
>
>
> Store them as hadoop archive (har)
>
>
> Le mer. 2 sept. 2015 à 18:07, < nib...@free.fr > a écrit :
>
>
> Hello,
> I'am currently using Spark Streaming to collect small messages (events) ,
> size being <50 KB , volume is high (several millions per day) and I have to
> store those messages in HDFS.
> I understood that storing small files can be problematic in HDFS , how can
> I manage it ?
>
> Tks
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to