Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...
This is likely because hdfs's core-site.xml (or something similar) provides an "fs.default.name" which changes the default FileSystem and Spark uses the Hadoop FileSystem API to resolve paths. Anyway, your solution is definitely a good one -- another would be to remote hdfs from Spark's classpath if you didn't want it, or to specify an overriding fs.default.name . On Thu, Apr 10, 2014 at 2:30 PM, didata.us wrote: > Hi: > > I believe I figured out how the behavior here: > > A file specified to SparkContext like this '*/path/to/some/file*': > >- Will be interpreted as '*hdfs://*path/to/some/file', when settings >for HDFS are present in '*/etc/hadoop/conf/*-site.xml*'. >- Will be interpreted as '*file:///*path/to/some/file', (i.e. locally) >otherwise. > > I confirmed this behavior by temporarily doing this: > >- user$ sudo mv /etc/hadoop/conf /etc/hadoop/conf_ > > after which I re-ran my commands below. This time the SparkContext did, > indeed, look for and found, the file locally. > > In summary, '*/path/to/some/file*' is interpreted as an in-HDFS relative > path when a HDFS configuration is found; and interpreted as an absolute > local UNIX file path when a HDFS configuration is *not* found. > > To be on the safe side, it's probably best to qualify local files with ' > *file:///*' when that is what's intended; ahd with 'hdfs://' when HDFS is > what's intended. > > Hopes this helps someone. :) > > --- > Sincerely, > Noel M. Vega > DiData > www.didata.us > > On 2014-04-10 14:53, DiData wrote: > > Hi Alton: > > Thanks for the reply. I just wanted to build/use it from scratch to get a > better intuition of what's a happening. > > Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue > as my compiled version (i.e. it, too, > tried to access the HDFS / Name Node. Same exact error). > > However, a small breakthrough. Just now I tinkered some more and found > that this variation works: > > REPLACE THIS: >>> distData = > sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') WITH THIS: > >>>distData = sc.textFile(' > *file:///*home/user/Download/ml-10M100K/ratings.dat') That is, use ' > file:///'. I don't know if that is the correct way of specifying the URI > for local files, or whether this just *happens to work*. The documents that > I've read thus far haven't shown it that specified way, but I still have > more to read. > > =:) > > > Thank you, > > ~NMV > > > On 04/10/2014 04:20 PM, Alton Alexander wrote: > > I am doing the exact same thing for the purpose of learning. I also > don't have a hadoop cluster and plan to scale on ec2 as soon as I get > it working locally. > > I am having good success just using the binaries on and not compiling > from source... Is there a reason why you aren't just using the > binaries? > > On Thu, Apr 10, 2014 at 1:30 PM, DiData > wrote: > > Hello friends: > > I recently compiled and installed Spark v0.9 from the Apache distribution. > > Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually, > the > entire big-data suite from CDH is installed), but for the moment I'm using > my > manually built Apache Spark for 'ground-up' learning purposes. > > Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the > following: > > export SPARK_YARN=true > export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 > > The resulting examples ran fine locally as well as on YARN. > > I'm not interested in YARN here; just mentioning it for completeness in case > that matters in > my upcoming question. Here is my issue / question: > > I start pyspark locally -- on one machine for API learning purposes -- as > shown below, and attempt to > interact with a local text file (not in HDFS). Unfortunately, the > SparkContext (sc) tries to connect to > a HDFS Name Node (which I don't currently have enabled because I don't need > it). > > The SparkContext cleverly inspects the configurations in my > '/etc/hadoop/conf/' directory to learn > where my Name Node is, however I don't want it to do that in this case. I > just want it to run a > one-machine local version of 'pyspark'. > > Did I miss something in my invocation/use of 'pyspark' below? Do I need to > add something else? > > (Btw: I searched but could not find any solutions, and the documentation, > while good, doesn't > quite get me there). > > See below, and thank you all in advance! > > > user$ export PYSPARK_PYTHON=/usr/bin/bpython > user$ export MASTER=local[8] > user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark > # > === > >>> sc > > >>> > >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') > >>> distData.count() > [ ... snip ... ] > Py4JJavaError: An error occurred while calling o21.collect. > : java.net.ConnectException: Call From server01/192.168.0.15 to > namenode:8020 failed on connection exception: > java.net.ConnectException:
Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...
Hi: I believe I figured out how the behavior here: A file specified to SparkContext like this '/PATH/TO/SOME/FILE': * Will be interpreted as 'HDFS://path/to/some/file', when settings for HDFS are present in '/ETC/HADOOP/CONF/*-SITE.XML'. * Will be interpreted as 'FILE:///path/to/some/file', (i.e. locally) otherwise. I confirmed this behavior by temporarily doing this: * user$ sudo mv /etc/hadoop/conf /etc/hadoop/conf_ after which I re-ran my commands below. This time the SparkContext did, indeed, look for and found, the file locally. In summary, '/PATH/TO/SOME/FILE' is interpreted as an in-HDFS relative path when a HDFS configuration is found; and interpreted as an absolute local UNIX file path when a HDFS configuration is *not* found. To be on the safe side, it's probably best to qualify local files with 'FILE:///' when that is what's intended; ahd with 'hdfs://' when HDFS is what's intended. Hopes this helps someone. :) --- Sincerely, Noel M. Vega DiData www.didata.us On 2014-04-10 14:53, DiData wrote: > Hi Alton: > > Thanks for the reply. I just wanted to build/use it from scratch to get a > better intuition of what's a happening. > > Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue as > my compiled version (i.e. it, too, > tried to access the HDFS / Name Node. Same exact error). > > However, a small breakthrough. Just now I tinkered some more and found that > this variation works: > > REPLACE THIS: >>> distData = > sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') WITH THIS: >>> > distData = sc.textFile('FILE:/// > [2]home/user/Download/ml-10M100K/ratings.dat') That is, use 'file:/// [2]'. I > don't know if that is the correct way of specifying the URI for local files, > or whether this just *happens to work*. The documents that I've read thus far > haven't shown it that specified way, but I still have more to read. > > =:) > > Thank you, > > ~NMV > > On 04/10/2014 04:20 PM, Alton Alexander wrote: > > I am doing the exact same thing for the purpose of learning. I also > don't have a hadoop cluster and plan to scale on ec2 as soon as I get > it working locally. > > I am having good success just using the binaries on and not compiling > from source... Is there a reason why you aren't just using the > binaries? > > On Thu, Apr 10, 2014 at 1:30 PM, DiData wrote: > > Hello friends: > > I recently compiled and installed Spark v0.9 from the Apache distribution. > > Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually, > the > entire big-data suite from CDH is installed), but for the moment I'm using > my > manually built Apache Spark for 'ground-up' learning purposes. > > Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the > following: > > export SPARK_YARN=true > export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 > > The resulting examples ran fine locally as well as on YARN. > > I'm not interested in YARN here; just mentioning it for completeness in case > that matters in > my upcoming question. Here is my issue / question: > > I start pyspark locally -- on one machine for API learning purposes -- as > shown below, and attempt to > interact with a local text file (not in HDFS). Unfortunately, the > SparkContext (sc) tries to connect to > a HDFS Name Node (which I don't currently have enabled because I don't need > it). > > The SparkContext cleverly inspects the configurations in my > '/etc/hadoop/conf/' directory to learn > where my Name Node is, however I don't want it to do that in this case. I > just want it to run a > one-machine local version of 'pyspark'. > > Did I miss something in my invocation/use of 'pyspark' below? Do I need to > add something else? > > (Btw: I searched but could not find any solutions, and the documentation, > while good, doesn't > quite get me there). > > See below, and thank you all in advance! > > user$ export PYSPARK_PYTHON=/usr/bin/bpython > user$ export MASTER=local[8] > user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark > # > === sc > distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') distData.count() > [ ... snip ... ] > Py4JJavaError: An error occurred while calling o21.collect. > : java.net.ConnectException: Call From server01/192.168.0.15 to > namenode:8020 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused [1] > [ ... snip ... ] > # > === > > -- > Sincerely, > DiData -- Sincerely, DiData Links: -- [1] http://wiki.apache.org/hadoop/ConnectionRefused [2] file:///
Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...
Hi Alton: Thanks for the reply. I just wanted to build/use it from scratch to get a better intuition of what's a happening. Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue as my compiled version (i.e. it, too, tried to access the HDFS / Name Node. Same exact error). However, a small breakthrough. Just now I tinkered some more and found that this variation works: REPLACE THIS: >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') WITH THIS:>>> distData = sc.textFile('*file:///*home/user/Download/ml-10M100K/ratings.dat') That is, use 'file:///'. I don't know if that is the correct way of specifying the URI for local files, or whether this just *happens to work*. The documents that I've read thus far haven't shown it that specified way, but I still have more to read. =:) Thank you, ~NMV On 04/10/2014 04:20 PM, Alton Alexander wrote: I am doing the exact same thing for the purpose of learning. I also don't have a hadoop cluster and plan to scale on ec2 as soon as I get it working locally. I am having good success just using the binaries on and not compiling from source... Is there a reason why you aren't just using the binaries? On Thu, Apr 10, 2014 at 1:30 PM, DiData wrote: Hello friends: I recently compiled and installed Spark v0.9 from the Apache distribution. Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually, the entire big-data suite from CDH is installed), but for the moment I'm using my manually built Apache Spark for 'ground-up' learning purposes. Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the following: export SPARK_YARN=true export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 The resulting examples ran fine locally as well as on YARN. I'm not interested in YARN here; just mentioning it for completeness in case that matters in my upcoming question. Here is my issue / question: I start pyspark locally -- on one machine for API learning purposes -- as shown below, and attempt to interact with a local text file (not in HDFS). Unfortunately, the SparkContext (sc) tries to connect to a HDFS Name Node (which I don't currently have enabled because I don't need it). The SparkContext cleverly inspects the configurations in my '/etc/hadoop/conf/' directory to learn where my Name Node is, however I don't want it to do that in this case. I just want it to run a one-machine local version of 'pyspark'. Did I miss something in my invocation/use of 'pyspark' below? Do I need to add something else? (Btw: I searched but could not find any solutions, and the documentation, while good, doesn't quite get me there). See below, and thank you all in advance! user$ export PYSPARK_PYTHON=/usr/bin/bpython user$ export MASTER=local[8] user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark # === >>> sc >>> >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') >>> distData.count() [ ... snip ... ] Py4JJavaError: An error occurred while calling o21.collect. : java.net.ConnectException: Call From server01/192.168.0.15 to namenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused [ ... snip ... ] >>> >>> # === -- Sincerely, DiData -- Sincerely, DiData
Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...
I am doing the exact same thing for the purpose of learning. I also don't have a hadoop cluster and plan to scale on ec2 as soon as I get it working locally. I am having good success just using the binaries on and not compiling from source... Is there a reason why you aren't just using the binaries? On Thu, Apr 10, 2014 at 1:30 PM, DiData wrote: > Hello friends: > > I recently compiled and installed Spark v0.9 from the Apache distribution. > > Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually, > the > entire big-data suite from CDH is installed), but for the moment I'm using > my > manually built Apache Spark for 'ground-up' learning purposes. > > Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the > following: > > export SPARK_YARN=true > export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 > > The resulting examples ran fine locally as well as on YARN. > > I'm not interested in YARN here; just mentioning it for completeness in case > that matters in > my upcoming question. Here is my issue / question: > > I start pyspark locally -- on one machine for API learning purposes -- as > shown below, and attempt to > interact with a local text file (not in HDFS). Unfortunately, the > SparkContext (sc) tries to connect to > a HDFS Name Node (which I don't currently have enabled because I don't need > it). > > The SparkContext cleverly inspects the configurations in my > '/etc/hadoop/conf/' directory to learn > where my Name Node is, however I don't want it to do that in this case. I > just want it to run a > one-machine local version of 'pyspark'. > > Did I miss something in my invocation/use of 'pyspark' below? Do I need to > add something else? > > (Btw: I searched but could not find any solutions, and the documentation, > while good, doesn't > quite get me there). > > See below, and thank you all in advance! > > > user$ export PYSPARK_PYTHON=/usr/bin/bpython > user$ export MASTER=local[8] > user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark > # > === > >>> sc > > >>> > >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') > >>> distData.count() > [ ... snip ... ] > Py4JJavaError: An error occurred while calling o21.collect. > : java.net.ConnectException: Call From server01/192.168.0.15 to > namenode:8020 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > [ ... snip ... ] > >>> > >>> > # > === > > -- > Sincerely, > DiData