[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce

2016-04-26 Thread Mitchell Gudmundson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258734#comment-15258734
 ] 

Mitchell Gudmundson commented on HDFS-10327:


Greetings,

Unless I'm mistaken this is not a Spark specific issue. Even when running 
simple mapreduce jobs you end up with a directory of part files part-r where r 
is the reducer number. These directories are generally meant to be interpreted 
as one logical "file". In the Spark world when writing out an RDD or Dataframe 
you get a part file per partition (just the same as you would per reducer on 
the MR framework), however the concept is no different than on other 
distributed processing engines. It seems that one would want to be able to 
retrieve back the file contents of the various parts as a whole.

Regards,
-Mitch

> Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
> 
>
> Key: HDFS-10327
> URL: https://issues.apache.org/jira/browse/HDFS-10327
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Reporter: Thomas Hille
>  Labels: features
>
> When Spark saves a file in HDFS it creates a directory which includes many 
> parts of the file. When you read it with spark programmatically, you can read 
> this directory as it is a normal file.
> If you try to read this directory-style file in webhdfs, it returns 
> {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path
>  is not a file: [...]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce

2016-04-26 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258567#comment-15258567
 ] 

Chris Nauroth commented on HDFS-10327:
--

It looks like in that example, myfile.csv is a directory, and its contents are 
3 files: _SUCCESS, part-0 and part-1.  Attempting to open myfile.csv 
directly as a file definitely won't work.  If Spark has a feature that lets you 
"open" it directly, then perhaps this is implemented at the application layer 
by Spark?  Maybe it does something equivalent to {{hdfs dfs -cat 
myfile.csv/part*}}?

That last example demonstrates the separation of concerns I'm talking about: 
the Hadoop shell command performs glob expansion to identify all files matching 
a pattern, and then it opens and displays each file separately, using HDFS APIs 
that operate on individual file paths.

> Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
> 
>
> Key: HDFS-10327
> URL: https://issues.apache.org/jira/browse/HDFS-10327
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Reporter: Thomas Hille
>  Labels: features
>
> When Spark saves a file in HDFS it creates a directory which includes many 
> parts of the file. When you read it with spark programmatically, you can read 
> this directory as it is a normal file.
> If you try to read this directory-style file in webhdfs, it returns 
> {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path
>  is not a file: [...]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce

2016-04-26 Thread Thomas Hille (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258520#comment-15258520
 ] 

Thomas Hille commented on HDFS-10327:
-

GET 
http://:50070/webhdfs/v1/data/output/adp/myfile.csv?user.name=alice&op=OPEN

returns:

{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path
 is not a file: /data/output/adp/adp_perf_7milx.csv\n\tat 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)\n\tat
 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)\n\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)\n\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)\n\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)\n\tat
 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)\n\tat
 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)\n\tat
 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)\n\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)\n\tat 
java.security.AccessController.doPrivileged(Native Method)\n\tat 
javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)\n"}}


and GET 
http://:50070/webhdfs/v1/data/output/adp/myfile.csv?user.name=alice&op=LISTSTATUS
returns

{"FileStatuses":{"FileStatus":[
{"accessTime":1460595235123,"blockSize":134217728,"childrenNum":0,"fileId":169558,"group":"hdfs","length":0,"modificationTime":1460595235193,"owner":"lroot","pathSuffix":"_SUCCESS","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},
{"accessTime":1460590333797,"blockSize":134217728,"childrenNum":0,"fileId":155388,"group":"hdfs","length":3529732,"modificationTime":1460590334756,"owner":"lroot","pathSuffix":"part-0","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},
{"accessTime":1460590295008,"blockSize":134217728,"childrenNum":0,"fileId":154918,"group":"hdfs","length":3540006,"modificationTime":1460590296204,"owner":"lroot","pathSuffix":"part-1","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},

> Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
> 
>
> Key: HDFS-10327
> URL: https://issues.apache.org/jira/browse/HDFS-10327
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Reporter: Thomas Hille
>  Labels: features
>
> When Spark saves a file in HDFS it creates a directory which includes many 
> parts of the file. When you read it with spark programmatically, you can read 
> this directory as it is a normal file.
> If you try to read this directory-style file in webhdfs, it returns 
> {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path
>  is not a file: [...]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce

2016-04-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258517#comment-15258517
 ] 

Thomas Graves commented on HDFS-10327:
--

What command are you using to try to read when you get the error?

> Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
> 
>
> Key: HDFS-10327
> URL: https://issues.apache.org/jira/browse/HDFS-10327
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Reporter: Thomas Hille
>  Labels: features
>
> When Spark saves a file in HDFS it creates a directory which includes many 
> parts of the file. When you read it with spark programmatically, you can read 
> this directory as it is a normal file.
> If you try to read this directory-style file in webhdfs, it returns 
> {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path
>  is not a file: [...]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10327) Open files in WEBHDFS which are stored in folders by Spark/Mapreduce

2016-04-26 Thread Thomas Hille (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258488#comment-15258488
 ] 

Thomas Hille commented on HDFS-10327:
-

Hi guys,
It looks like splitting the file in parts is a mapreduce feature rather than 
spak specific (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html -- 
look at the output of $bin/hadoop dfs -cat 
/usr/joe/wordcount/output/part-0).
So its maybe still something for you guys?

> Open files in WEBHDFS which are stored in folders by Spark/Mapreduce
> 
>
> Key: HDFS-10327
> URL: https://issues.apache.org/jira/browse/HDFS-10327
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Reporter: Thomas Hille
>  Labels: features
>
> When Spark saves a file in HDFS it creates a directory which includes many 
> parts of the file. When you read it with spark programmatically, you can read 
> this directory as it is a normal file.
> If you try to read this directory-style file in webhdfs, it returns 
> {"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path
>  is not a file: [...]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)