Re: Change proposal for FileInputFormat isSplitable

2014-06-06 Thread Niels Basjes
On Mon, Jun 2, 2014 at 1:21 AM, Chris Douglas  wrote:

> On Sat, May 31, 2014 at 10:53 PM, Niels Basjes  wrote:
> > The Hadoop framework uses the filename extension  to automatically insert
> > the "right" decompression codec in the read pipeline.
>
> This would be the new behavior, incompatible with existing code.
>

You are right, I was wrong. It is the LineRecordReader that inserts it.

Looking at this code and where it is used I noticed that the bug I'm trying
to prevent is present in the current trunk.
The NLineInputFormat does not override the isSplitable and used the
LineRecordReader that is capable of reading gzipped input. Overall effect
is that this inputformat silently produces garbage (missing lines +
duplicated lines) when when ran against a gzipped file. I just verified
this.

> So if someone does what you describe then they would need to unload all
> compression codecs or face decompression errors. And if it really was
> gzipped then it would not be splittable at all.

Assume an InputFormat configured for a job assumes that isSplitable
> returns true because it extends FileInputFormat. After the change, it
> could spuriously return false based on the suffix of the input files.
> In the prenominate example, SequenceFile is splittable, even if the
> codec used in each block is not. -C
>

and if you then give the file the .gz extension this breaks all common
sense / conventions about file names.


Let's reiterate the options I see now:
1) isSplitable --> return true
Too unsafe, I say "must change". I alone hit my head twice so far on
this, many others have too, event the current trunk still has this bug in
there.

2) isSplitable --> return false
Safe but too slow in some cases. In those cases the actual
implementation can simply override it very easily and regain their original
performance.

3) isSplitable --> true (same as the current implementation) unless you use
a file extension that is associated with a non splittable compression codec
(i.e.  .gz or something like that).
If a custom format want to break with well known conventions about
filenames then they should simply override the isSplitable with their own.

4) isSplitable --> abstract
Compatibility breaker. I see this as the cleanest way to force the
developer of the custom fileinputformat to think about their specific case.

I hold "correct data" much higher than performance and scalability; so the
performance impact is a concern but it is much less important than the list
of bugs we are facing right now.

The safest way would be either 2 or 4. Solution 3 would effectively be the
same as the current implementation, yet it would catch the problem
situations as long as people stick to normal file name conventions.
Solution 3 would also allow removing some code duplication in several
subclasses.

I would go for solution 3.

Niels Basjes


[jira] [Created] (HADOOP-10668) TestZKFailoverControllerStress#testExpireBackAndForth occasionally fails

2014-06-06 Thread Ted Yu (JIRA)
Ted Yu created HADOOP-10668:
---

 Summary: TestZKFailoverControllerStress#testExpireBackAndForth 
occasionally fails
 Key: HADOOP-10668
 URL: https://issues.apache.org/jira/browse/HADOOP-10668
 Project: Hadoop Common
  Issue Type: Test
Reporter: Ted Yu
Priority: Minor


>From 
>https://builds.apache.org/job/PreCommit-HADOOP-Build/4018//testReport/org.apache.hadoop.ha/TestZKFailoverControllerStress/testExpireBackAndForth/
> :
{code}
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.server.DataTree.getData(DataTree.java:648)
at org.apache.zookeeper.server.ZKDatabase.getData(ZKDatabase.java:371)
at 
org.apache.hadoop.ha.MiniZKFCCluster.expireActiveLockHolder(MiniZKFCCluster.java:199)
at 
org.apache.hadoop.ha.MiniZKFCCluster.expireAndVerifyFailover(MiniZKFCCluster.java:234)
at 
org.apache.hadoop.ha.TestZKFailoverControllerStress.testExpireBackAndForth(TestZKFailoverControllerStress.java:84)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HADOOP-10647) String Format Exception in SwiftNativeFileSystemStore.java

2014-06-06 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved HADOOP-10647.
-

   Resolution: Fixed
Fix Version/s: 2.5.0

+1, committed to branch-2 and trunk. Thanks for finding this!

> String Format Exception in SwiftNativeFileSystemStore.java
> --
>
> Key: HADOOP-10647
> URL: https://issues.apache.org/jira/browse/HADOOP-10647
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/swift
>Affects Versions: 2.4.0
>Reporter: Gene Kim
>Assignee: Gene Kim
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: HADOOP-10647.patch, hadoop.patch
>
>
> If Swift.debug is given a string containing a % character, a format exception 
> will occur. This happens when the path for any of the FileStatus objects 
> contain a % encoded character. The bug is located at 
> hadoop/src/hadoop-tools/hadoop-openstack/src/main/java/org/apache/hadoop/fs/swift/snative/SwiftNativeFileSystemStore.java:931.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Build failed in Jenkins: Hadoop-Common-trunk #1131

2014-06-06 Thread Apache Jenkins Server
See 

Changes:

[umamahesh] HDFS-6464. Support multiple xattr.name parameters for WebHDFS 
getXAttrs. Contributed by Yi Liu.

[cmccabe] HDFS-6369. Document that BlockReader#available() can return more 
bytes than are remaining in the block (Ted Yu via Colin Patrick McCabe)

[junping_du] YARN-1977. Add tests on getApplicationRequest with filtering start 
time range. (Contributed by Junping Du)

--
[...truncated 67757 lines...]
[DEBUG] Initialize Maven Ant Tasks
parsing buildfile 
jar:file:/home/jenkins/.m2/repository/org/apache/maven/plugins/maven-antrun-plugin/1.7/maven-antrun-plugin-1.7.jar!/org/apache/maven/ant/tasks/antlib.xml
 with URI = 
jar:file:/home/jenkins/.m2/repository/org/apache/maven/plugins/maven-antrun-plugin/1.7/maven-antrun-plugin-1.7.jar!/org/apache/maven/ant/tasks/antlib.xml
 from a zip file
parsing buildfile 
jar:file:/home/jenkins/.m2/repository/org/apache/ant/ant/1.8.2/ant-1.8.2.jar!/org/apache/tools/ant/antlib.xml
 with URI = 
jar:file:/home/jenkins/.m2/repository/org/apache/ant/ant/1.8.2/ant-1.8.2.jar!/org/apache/tools/ant/antlib.xml
 from a zip file
Class org.apache.maven.ant.tasks.AttachArtifactTask loaded from parent loader 
(parentFirst)
 +Datatype attachartifact org.apache.maven.ant.tasks.AttachArtifactTask
Class org.apache.maven.ant.tasks.DependencyFilesetsTask loaded from parent 
loader (parentFirst)
 +Datatype dependencyfilesets org.apache.maven.ant.tasks.DependencyFilesetsTask
Setting project property: test.build.dir -> 

Setting project property: test.exclude.pattern -> _
Setting project property: hadoop.assemblies.version -> 3.0.0-SNAPSHOT
Setting project property: test.exclude -> _
Setting project property: distMgmtSnapshotsId -> apache.snapshots.https
Setting project property: project.build.sourceEncoding -> UTF-8
Setting project property: java.security.egd -> file:///dev/urandom
Setting project property: distMgmtSnapshotsUrl -> 
https://repository.apache.org/content/repositories/snapshots
Setting project property: distMgmtStagingUrl -> 
https://repository.apache.org/service/local/staging/deploy/maven2
Setting project property: avro.version -> 1.7.4
Setting project property: test.build.data -> 

Setting project property: commons-daemon.version -> 1.0.13
Setting project property: hadoop.common.build.dir -> 

Setting project property: testsThreadCount -> 4
Setting project property: maven.test.redirectTestOutputToFile -> true
Setting project property: jdiff.version -> 1.0.9
Setting project property: build.platform -> Linux-i386-32
Setting project property: project.reporting.outputEncoding -> UTF-8
Setting project property: distMgmtStagingName -> Apache Release Distribution 
Repository
Setting project property: protobuf.version -> 2.5.0
Setting project property: failIfNoTests -> false
Setting project property: protoc.path -> ${env.HADOOP_PROTOC_PATH}
Setting project property: jersey.version -> 1.9
Setting project property: distMgmtStagingId -> apache.staging.https
Setting project property: distMgmtSnapshotsName -> Apache Development Snapshot 
Repository
Setting project property: ant.file -> 

[DEBUG] Setting properties with prefix: 
Setting project property: project.groupId -> org.apache.hadoop
Setting project property: project.artifactId -> hadoop-common-project
Setting project property: project.name -> Apache Hadoop Common Project
Setting project property: project.description -> Apache Hadoop Common Project
Setting project property: project.version -> 3.0.0-SNAPSHOT
Setting project property: project.packaging -> pom
Setting project property: project.build.directory -> 

Setting project property: project.build.outputDirectory -> 

Setting project property: project.build.testOutputDirectory -> 

Setting project property: project.build.sourceDirectory -> 

Setting project property: project.build.testSourceDirectory -> 

Setting project property: localRepository ->id: local
  url: file:///home/jenkins/.m2/repository/
   layout: none
Setting project property: settings.localRepository

[jira] [Created] (HADOOP-10667) implement TCP connection reuse for native client

2014-06-06 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HADOOP-10667:
-

 Summary: implement TCP connection reuse for native client
 Key: HADOOP-10667
 URL: https://issues.apache.org/jira/browse/HADOOP-10667
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: native
Affects Versions: HADOOP-10388
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe


The HDFS / YARN native clients should re-use TCP connections to avoid the 
overhead of the three-way handshake, similar to how the Java code does.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: #Contributors on JIRA

2014-06-06 Thread Steve Loughran
think it's unrelated ... hadoop-* hasn't moved to the role-based model yet
AFAIK


On 6 June 2014 00:03, Henry Saputra  wrote:

> Hi Steve,
>
> Just notice this email. So do you know if the unable to assign issues
> for contributors a side effect result or intended behavior?
>
> - Henry
>
> On Fri, May 16, 2014 at 2:48 AM, Steve Loughran 
> wrote:
> > ASF JIRA has been moving to role-based over group-based security -you may
> > be able to give more people a role than a group allows. But, as of last
> > week and a spark-initiated change, by default contributors can't assign
> > issues.
> >
> > someone could talk to infra@apache and see if a move would help
> >
> >
> > On 14 May 2014 04:31, Suresh Srinivas  wrote:
> >
> >> Last time we cleaned up names of people who had not contributed in a
> long
> >> time. That could be an option.
> >>
> >>
> >> On Mon, May 12, 2014 at 12:03 PM, Karthik Kambatla  >> >wrote:
> >>
> >> > Hi devs
> >> >
> >> > Looks like we ran over the max contributors allowed for a project,
> >> again. I
> >> > don't remember what we did last time and can't find it in my email
> >> either.
> >> >
> >> > Can we bump up the number of contributors allowed? Otherwise, we might
> >> have
> >> > to remove some of the currently inactive contributors from the list?
> >> >
> >> > Thanks
> >> > Karthik
> >> >
> >>
> >>
> >>
> >> --
> >> http://hortonworks.com/download/
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> immediately
> >> and delete it from your system. Thank You.
> >>
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.