[ https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rex Xiong updated SPARK-6384: ----------------------------- Description: After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, _SUCCESS, _common_metadata, _metadata files successfully. But sometimes, there will be some attempt_* folder (e.g. attempt_201503170229_0006_r_000006_736, attempt_201503170229_0006_r_000404_416) under the same folder, it contains one parquet file, seems to be a working temp folder. It happens even though _SUCCESS file created. In this situation, SparkSQL (Hive table) throws exception when loading this parquet folder: Error: java.io.FileNotFoundException: Path is not a file: ............../attempt_201503170229_0006_r_000006_736 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations UpdateTimes(FSNamesystem.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations Int(FSNamesystem.java:1671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1651) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1625) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca tions(NameNodeRpcServer.java:503) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 2) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1594) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co de=0) I'm not sure whether it's a Spark bug or a Parquet bug. was: After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, _SUCCESS, _common_metadata, _metadata files successfully. But sometimes, there will be some attempt_* folder (e.g. attempt_201503170229_0006_r_000006_736, attempt_201503170229_0006_r_000404_416) under the same folder, it contains one parquet file, seems to be a working temp folder. It happens even though _SUCCESS file created. In this situation, SparkSQL throws exception when loading this parquet folder: Error: java.io.FileNotFoundException: Path is not a file: ............../attempt_201503170229_0006_r_000006_736 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations UpdateTimes(FSNamesystem.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations Int(FSNamesystem.java:1671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1651) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1625) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca tions(NameNodeRpcServer.java:503) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 2) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1594) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co de=0) I'm not sure whether it's a Spark bug or a Parquet bug. > saveAsParquet doesn't clean up attempt_* folders > ------------------------------------------------ > > Key: SPARK-6384 > URL: https://issues.apache.org/jira/browse/SPARK-6384 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.1 > Reporter: Rex Xiong > > After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, > _SUCCESS, _common_metadata, _metadata files successfully. > But sometimes, there will be some attempt_* folder (e.g. > attempt_201503170229_0006_r_000006_736, > attempt_201503170229_0006_r_000404_416) under the same folder, it contains > one parquet file, seems to be a working temp folder. > It happens even though _SUCCESS file created. > In this situation, SparkSQL (Hive table) throws exception when loading this > parquet folder: > Error: java.io.FileNotFoundException: Path is not a file: > ............../attempt_201503170229_0006_r_000006_736 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja > va:69) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja > va:55) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations > UpdateTimes(FSNamesystem.java:1728) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations > Int(FSNamesystem.java:1671) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations > (FSNamesystem.java:1651) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations > (FSNamesystem.java:1625) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca > tions(NameNodeRpcServer.java:503) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra > nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 > 2) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl > ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal > l(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma > tion.java:1594) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > (state=,co > de=0) > I'm not sure whether it's a Spark bug or a Parquet bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org