Re: Writing click stream data to hadoop

Harsh J Wed, 30 May 2012 19:38:00 -0700

Thanks for correcting me there on the syncFs call Luke. I seemed to
have missed that method when searching branch-1 code.


On Thu, May 31, 2012 at 6:54 AM, Luke Lu <l...@apache.org> wrote:
>
> SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since
> 0.20.205), which calls the underlying FSDataOutputStream#sync which is
> actually hflush semantically (data not durable in case of data center
> wide power outage). hsync implementation is not yet in 2.0. HDFS-744
> just brought hsync in trunk.
>
> __Luke
>
> On Fri, May 25, 2012 at 9:30 AM, Harsh J <ha...@cloudera.com> wrote:
> > Mohit,
> >
> > Not if you call sync (or hflush/hsync in 2.0) periodically to persist
> > your changes to the file. SequenceFile doesn't currently have a
> > sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
> > underlying output stream instead at the moment. This is possible to do
> > in 1.0 (just own the output stream).
> >
> > Your use case also sounds like you may want to simply use Apache Flume
> > (Incubating) [http://incubator.apache.org/flume/] that already does
> > provide these features and the WAL-kinda reliability you seek.
> >
> > On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <mohitanch...@gmail.com> 
> > wrote:
> >> We get click data through API calls. I now need to send this data to our
> >> hadoop environment. I am wondering if I could open one sequence file and
> >> write to it until it's of certain size. Once it's over the specified size I
> >> can close that file and open a new one. Is this a good approach?
> >>
> >> Only thing I worry about is what happens if the server crashes before I am
> >> able to cleanly close the file. Would I lose all previous data?
> >
> >
> >
> > --
> > Harsh J




--
Harsh J

Re: Writing click stream data to hadoop

Reply via email to