Hi,
I talked to some people and they agreed with me that really the situation
where this problem occurs is when they build a FileInputFormat derivative
that also uses a LineRecordReader derivative. This is exactly the scenario
that occurs if someone follows the Yahoo Hadoop tutorial.
Instead of c
I did some digging through the code base and inspected all the situations I
know where this goes wrong (including the yahoo tutorial) and found a place
that may be a spot to avoid the effects of this problem. (Instead of
solving the cause the problem)
It turns out that all of those use cases use t
On Fri, Jun 13, 2014 at 2:54 AM, Niels Basjes wrote:
> Hmmm, people only look at logs when they have a problem. So I don't think
> this would be enough.
This change to the framework will cause disruptions to users, to aid
InputFormat authors' debugging. The latter is a much smaller
population and
Hi,
On Wed, Jun 11, 2014 at 8:25 PM, Chris Douglas wrote:
> On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes wrote:
> > That's not what I meant. What I understood from what was described is
> that
> > sometimes people use an existing file extension (like .gz) for a file
> that
> > is not a gzipped
On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes wrote:
> That's not what I meant. What I understood from what was described is that
> sometimes people use an existing file extension (like .gz) for a file that
> is not a gzipped file.
Understood, but this change also applies to other loaded codecs,
On Tue, Jun 10, 2014 at 8:10 PM, Chris Douglas wrote:
> On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes wrote:
> > and if you then give the file the .gz extension this breaks all common
> > sense / conventions about file names.
>
> That the suffix for all compression codecs in every context- and
On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes wrote:
> and if you then give the file the .gz extension this breaks all common
> sense / conventions about file names.
That the suffix for all compression codecs in every context- and all
future codecs- should determine whether a file can be split is
On Mon, Jun 2, 2014 at 1:21 AM, Chris Douglas wrote:
> On Sat, May 31, 2014 at 10:53 PM, Niels Basjes wrote:
> > The Hadoop framework uses the filename extension to automatically insert
> > the "right" decompression codec in the read pipeline.
>
> This would be the new behavior, incompatible wi
On Fri, May 30, 2014 at 11:05 PM, Niels Basjes wrote:
> How would someone create the situation you are referring to?
By renaming files. I believe I can easily write a test using public
APIs that will succeed without this patch and will fail with this
patch. Given the number of Hadoop-based appl
On Sat, May 31, 2014 at 10:53 PM, Niels Basjes wrote:
> The Hadoop framework uses the filename extension to automatically insert
> the "right" decompression codec in the read pipeline.
This would be the new behavior, incompatible with existing code.
> So if someone does what you describe then t
The Hadoop framework uses the filename extension to automatically insert
the "right" decompression codec in the read pipeline.
So if someone does what you describe then they would need to unload all
compression codecs or face decompression errors. And if it really was
gzipped then it would not be
On Fri, May 30, 2014 at 11:05 PM, Niels Basjes wrote:
> How would someone create the situation you are referring to?
By adopting a naming convention where the filename suffix doesn't
imply that the raw data are compressed with that codec.
For example, if a user named SequenceFiles foo.lzo and fo
Ok, got it.
If someone has an Avro file (foo.avro) and gzips that ( foo.avro.gz) then
the frame work will select the GzipCodec which is not capable of splitting
and which will cause the problem. So by gzipping a splittable file it
becomes non splittable.
At my workplace we have applied gzip to av
I was trying to explain my comment, where I stated that, "changing the
default implementation to return false would be an incompatible
change". The patch was added 6 months after that comment, so the
comment didn't address the patch.
The patch does not appear to change the default implementation
Remove
On Fri, May 30, 2014 at 3:03 PM, Niels Basjes wrote:
> Hi,
> The way I see the effects of the original patch on existing subclasses:
> - implemented isSplitable
>--> no performance difference.
> - did not implement isSplitable
>--> then there is no performance difference if the co
Hi,
The way I see the effects of the original patch on existing subclasses:
- implemented isSplitable
--> no performance difference.
- did not implement isSplitable
--> then there is no performance difference if the container is either
not compressed or uses a splittable compression.
-->
On Thu, May 29, 2014 at 2:47 AM, Niels Basjes wrote:
> For arguments I still do not fully understand this was rejected by Todd and
> Doug.
Performance is a part of compatibility.
Doug
I forgot to ask a relevant question: What made the original proposed
solution "incompatible"?
To me it still seems to be a clean backward compatible solution that fixes
this issue in a simple way.
Perhaps Todd can explain why?
Niels
On May 29, 2014 2:17 PM, "Niels Basjes" wrote:
> This is exact
This is exactly why I'm proposing a change that will either 'fix silently'
(my original patch from 3 years ago) or 'break loudly' (my current
proposal) old implementations.
I'm convinced that ther are atleast 100 companies world wide that have a
custom implementation with this bug and have no clue
I think breaking backwards compat is sensible since It's easily caught by the
compiler and in this case where the alternative is a
Runtime error that can result in terabytes of mucked up output.
> On May 29, 2014, at 6:11 AM, Matt Fellows
> wrote:
>
> As someone who doesn't really contribute
As someone who doesn't really contribute, just lurks, I could well be
misinformed or under-informed, but I don't see why we can't deprecate a
method which could cause dangerous side effects?
People can still use the deprecated methods for backwards compatibility,
but are discouraged by compiler war
My original proposal (from about 3 years ago) was to change the isSplitable
method to return a safe default ( you can see that in the patch that is
still attached to that Jira issue).
For arguments I still do not fully understand this was rejected by Todd and
Doug.
So that is why my new proposal i
I could be missing something, but couldn't you just deprecate isSplitable
(spelled incorrectly) and create a new isSplittable as described?
On Thu, May 29, 2014 at 10:34 AM, Steve Loughran
wrote:
> On 28 May 2014 20:50, Niels Basjes wrote:
>
> > Hi,
> >
> > Last week I ran into this problem ag
On 28 May 2014 20:50, Niels Basjes wrote:
> Hi,
>
> Last week I ran into this problem again
> https://issues.apache.org/jira/browse/MAPREDUCE-2094
>
> What happens here is that the default implementation of the isSplitable
> method in FileInputFormat is so unsafe that just about everyone who
> im
24 matches
Mail list logo