[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
[ https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258734#comment-15258734 ] Mitchell Gudmundson commented on HDFS-10327: Greetings, Unless I'm mistaken this is not a Spark specific issue. Even when running simple mapreduce jobs you end up with a directory of part files part-r where r is the reducer number. These directories are generally meant to be interpreted as one logical "file". In the Spark world when writing out an RDD or Dataframe you get a part file per partition (just the same as you would per reducer on the MR framework), however the concept is no different than on other distributed processing engines. It seems that one would want to be able to retrieve back the file contents of the various parts as a whole. Regards, -Mitch > Open files in WEBHDFS which are stored in folders by Spark/Mapreduce > > > Key: HDFS-10327 > URL: https://issues.apache.org/jira/browse/HDFS-10327 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Reporter: Thomas Hille > Labels: features > > When Spark saves a file in HDFS it creates a directory which includes many > parts of the file. When you read it with spark programmatically, you can read > this directory as it is a normal file. > If you try to read this directory-style file in webhdfs, it returns > {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path > is not a file: [...] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
[ https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258567#comment-15258567 ] Chris Nauroth commented on HDFS-10327: -- It looks like in that example, myfile.csv is a directory, and its contents are 3 files: _SUCCESS, part-0 and part-1. Attempting to open myfile.csv directly as a file definitely won't work. If Spark has a feature that lets you "open" it directly, then perhaps this is implemented at the application layer by Spark? Maybe it does something equivalent to {{hdfs dfs -cat myfile.csv/part*}}? That last example demonstrates the separation of concerns I'm talking about: the Hadoop shell command performs glob expansion to identify all files matching a pattern, and then it opens and displays each file separately, using HDFS APIs that operate on individual file paths. > Open files in WEBHDFS which are stored in folders by Spark/Mapreduce > > > Key: HDFS-10327 > URL: https://issues.apache.org/jira/browse/HDFS-10327 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Reporter: Thomas Hille > Labels: features > > When Spark saves a file in HDFS it creates a directory which includes many > parts of the file. When you read it with spark programmatically, you can read > this directory as it is a normal file. > If you try to read this directory-style file in webhdfs, it returns > {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path > is not a file: [...] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
[ https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258520#comment-15258520 ] Thomas Hille commented on HDFS-10327: - GET http://:50070/webhdfs/v1/data/output/adp/myfile.csv?user.name=alice&op=OPEN returns: {"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path is not a file: /data/output/adp/adp_perf_7milx.csv\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)\n"}} and GET http://:50070/webhdfs/v1/data/output/adp/myfile.csv?user.name=alice&op=LISTSTATUS returns {"FileStatuses":{"FileStatus":[ {"accessTime":1460595235123,"blockSize":134217728,"childrenNum":0,"fileId":169558,"group":"hdfs","length":0,"modificationTime":1460595235193,"owner":"lroot","pathSuffix":"_SUCCESS","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"}, {"accessTime":1460590333797,"blockSize":134217728,"childrenNum":0,"fileId":155388,"group":"hdfs","length":3529732,"modificationTime":1460590334756,"owner":"lroot","pathSuffix":"part-0","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"}, {"accessTime":1460590295008,"blockSize":134217728,"childrenNum":0,"fileId":154918,"group":"hdfs","length":3540006,"modificationTime":1460590296204,"owner":"lroot","pathSuffix":"part-1","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"}, > Open files in WEBHDFS which are stored in folders by Spark/Mapreduce > > > Key: HDFS-10327 > URL: https://issues.apache.org/jira/browse/HDFS-10327 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Reporter: Thomas Hille > Labels: features > > When Spark saves a file in HDFS it creates a directory which includes many > parts of the file. When you read it with spark programmatically, you can read > this directory as it is a normal file. > If you try to read this directory-style file in webhdfs, it returns > {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path > is not a file: [...] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
[ https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258517#comment-15258517 ] Thomas Graves commented on HDFS-10327: -- What command are you using to try to read when you get the error? > Open files in WEBHDFS which are stored in folders by Spark/Mapreduce > > > Key: HDFS-10327 > URL: https://issues.apache.org/jira/browse/HDFS-10327 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Reporter: Thomas Hille > Labels: features > > When Spark saves a file in HDFS it creates a directory which includes many > parts of the file. When you read it with spark programmatically, you can read > this directory as it is a normal file. > If you try to read this directory-style file in webhdfs, it returns > {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path > is not a file: [...] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
[ https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258488#comment-15258488 ] Thomas Hille commented on HDFS-10327: - Hi guys, It looks like splitting the file in parts is a mapreduce feature rather than spak specific (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html -- look at the output of $bin/hadoop dfs -cat /usr/joe/wordcount/output/part-0). So its maybe still something for you guys? > Open files in WEBHDFS which are stored in folders by Spark/Mapreduce > > > Key: HDFS-10327 > URL: https://issues.apache.org/jira/browse/HDFS-10327 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Reporter: Thomas Hille > Labels: features > > When Spark saves a file in HDFS it creates a directory which includes many > parts of the file. When you read it with spark programmatically, you can read > this directory as it is a normal file. > If you try to read this directory-style file in webhdfs, it returns > {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path > is not a file: [...] -- This message was sent by Atlassian JIRA (v6.3.4#6332)