Re: ...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

2014-12-15 Thread Karen Murphy


Thanks Akhil,

In line with your suggestion I have used the following 2 commands to 
flatten the directory structure:


find . -type f -iname '*' -exec  mv '{}' . \;
find . -type d -exec rm -rf '{}' \;

Kind Regards
Karen



On 12/12/14 13:25, Akhil Das wrote:
I'm not quiet sure whether spark will go inside subdirectories and 
pick up files from it. You could do something like following to bring 
all files to one directory.


find . -iname '*' -exec mv '{}' . \;


Thanks
Best Regards

On Fri, Dec 12, 2014 at 6:34 PM, Karen Murphy k.l.mur...@qub.ac.uk 
mailto:k.l.mur...@qub.ac.uk wrote:



When I try to load a text file from a HDFS path using

sc.wholeTextFiles(hdfs://localhost:54310/graphx/anywebsite.com/anywebsite.com/
http://anywebsite.com/anywebsite.com/)

I'm get the following error:
java.io.FileNotFoundException: Path is not a file:
/graphx/anywebsite.com/anywebsite.com/css
http://anywebsite.com/anywebsite.com/css
(full stack trace at bottom of message).

If I switch my Scala code to reading the input file from the local
disk, wholeTextFiles doesn't pickup directories (such as css in
this case) and there is no exception raised.

The trace information in the 'local file' version shows that only
plain text files are collected with sc.wholeTextFiles:

14/12/12 11:51:29 INFO WholeTextFileRDD: Input split:

Paths:/tmp/anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247

http://anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247

Yet the trace information in the 'HDFS file' version shows
directories too are collected with sc.wholeTextFiles:

14/12/12 11:49:07 INFO WholeTextFileRDD: Input split:

Paths:/graphx/anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0

http://anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
14/12/12 11:49:07 ERROR Executor: Exception in task 1.0 in stage
0.0 (TID 1)
java.io.FileNotFoundException: Path is not a file:
/graphx/anywebsite.com/anywebsite.com/css
http://anywebsite.com/anywebsite.com/css
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)

Should the HDFS version behave the same as the local version of
wholeTextFiles as far as the treatment of directories/non plain
text files are concerned ?

Any help, advice or workaround suggestions would be much appreciated,

Thanks
Karen

VERSION INFO
Ubuntu 14.04
Spark 1.1.1
Hadoop 2.5.2
Scala 2.10.4

FULL STACK TRACE
14/12/12 12:02:31 INFO WholeTextFileRDD: Input split:


...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

2014-12-12 Thread Karen Murphy

When I try to load a text file from a HDFS path using 
sc.wholeTextFiles(hdfs://localhost:54310/graphx/anywebsite.com/anywebsite.com/)

I'm get the following error:

java.io.FileNotFoundException: Path is not a file: 
/graphx/anywebsite.com/anywebsite.com/css
(full stack trace at bottom of message).

If I switch my Scala code to reading the input file from the local disk, 
wholeTextFiles doesn't pickup directories (such as css in this case) and there 
is no exception raised.

The trace information in the 'local file' version shows that only plain text 
files are collected with sc.wholeTextFiles:

14/12/12 11:51:29 INFO WholeTextFileRDD: Input split: 
Paths:/tmp/anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247

Yet the trace information in the 'HDFS file' version shows directories too are 
collected with sc.wholeTextFiles:

14/12/12 11:49:07 INFO WholeTextFileRDD: Input split: 
Paths:/graphx/anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
14/12/12 11:49:07 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.io.FileNotFoundException: Path is not a file: 
/graphx/anywebsite.com/anywebsite.com/css
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)

Should the HDFS version behave the same as the local version of wholeTextFiles 
as far as the treatment of directories/non plain text files are concerned ?

Any help, advice or workaround suggestions would be much appreciated,

Thanks
Karen

VERSION INFO
Ubuntu 14.04
Spark 1.1.1
Hadoop 2.5.2
Scala 2.10.4

FULL STACK TRACE
14/12/12 12:02:31 INFO WholeTextFileRDD: Input split: 
Paths:/graphx/anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
14/12/12 12:02:31 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.io.FileNotFoundException: Path is not a file: 
/graphx/anywebsite.com/anywebsite.com/css
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 

Re: ...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

2014-12-12 Thread Akhil Das
I'm not quiet sure whether spark will go inside subdirectories and pick up
files from it. You could do something like following to bring all files to
one directory.

find . -iname '*' -exec mv '{}' . \;


Thanks
Best Regards

On Fri, Dec 12, 2014 at 6:34 PM, Karen Murphy k.l.mur...@qub.ac.uk wrote:


  When I try to load a text file from a HDFS path using
 sc.wholeTextFiles(hdfs://localhost:54310/graphx/
 anywebsite.com/anywebsite.com/)

  I'm get the following error:

 java.io.FileNotFoundException: Path is not a file: /graphx/
 anywebsite.com/anywebsite.com/css
 (full stack trace at bottom of message).

  If I switch my Scala code to reading the input file from the local disk,
 wholeTextFiles doesn't pickup directories (such as css in this case) and
 there is no exception raised.

  The trace information in the 'local file' version shows that only plain
 text files are collected with sc.wholeTextFiles:

  14/12/12 11:51:29 INFO WholeTextFileRDD: Input split: Paths:/tmp/
 anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247

  Yet the trace information in the 'HDFS file' version shows directories
 too are collected with sc.wholeTextFiles:

  14/12/12 11:49:07 INFO WholeTextFileRDD: Input split: Paths:/graphx/
 anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
 14/12/12 11:49:07 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID
 1)
 java.io.FileNotFoundException: Path is not a file: /graphx/
 anywebsite.com/anywebsite.com/css
 at
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
 at
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)

  Should the HDFS version behave the same as the local version of
 wholeTextFiles as far as the treatment of directories/non plain text files
 are concerned ?

  Any help, advice or workaround suggestions would be much appreciated,

  Thanks
 Karen

  VERSION INFO
 Ubuntu 14.04
 Spark 1.1.1
 Hadoop 2.5.2
 Scala 2.10.4

  FULL STACK TRACE
 14/12/12 12:02:31 INFO WholeTextFileRDD: Input split: Paths:/graphx/
 anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
 14/12/12 12:02:31 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID
 1)
 java.io.FileNotFoundException: Path is not a file: /graphx/
 anywebsite.com/anywebsite.com/css
 at
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
 at
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
 at
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at