Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Aaron Davidson
This is likely because hdfs's core-site.xml (or something similar) provides
an "fs.default.name" which changes the default FileSystem and Spark uses
the Hadoop FileSystem API to resolve paths. Anyway, your solution is
definitely a good one -- another would be to remote hdfs from Spark's
classpath if you didn't want it, or to specify an overriding fs.default.name
.


On Thu, Apr 10, 2014 at 2:30 PM, didata.us  wrote:

>  Hi:
>
> I believe I figured out how the behavior here:
>
> A file specified to SparkContext like this '*/path/to/some/file*':
>
>- Will be interpreted as '*hdfs://*path/to/some/file', when settings
>for HDFS are present in '*/etc/hadoop/conf/*-site.xml*'.
>- Will be interpreted as '*file:///*path/to/some/file', (i.e. locally)
>otherwise.
>
> I confirmed this behavior by temporarily doing this:
>
>- user$ sudo mv /etc/hadoop/conf /etc/hadoop/conf_
>
> after which I re-ran my commands below. This time the SparkContext did,
> indeed, look for and found, the file locally.
>
> In summary, '*/path/to/some/file*' is interpreted as an in-HDFS relative
> path when a HDFS configuration is found; and interpreted as an absolute
> local UNIX file path when a HDFS configuration is *not* found.
>
> To be on the safe side, it's probably best to qualify local files with '
> *file:///*' when that is what's intended; ahd with 'hdfs://' when HDFS is
> what's intended.
>
> Hopes this helps someone. :)
>
> ---
> Sincerely,
> Noel M. Vega
> DiData
> www.didata.us
>
>  On 2014-04-10 14:53, DiData wrote:
>
> Hi Alton:
>
> Thanks for the reply. I just wanted to build/use it from scratch to get a
> better intuition of what's a happening.
>
> Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue
> as my compiled version (i.e. it, too,
> tried to access the HDFS / Name Node. Same exact error).
>
> However, a small breakthrough. Just now I tinkered some more and found
> that this variation works:
>
> REPLACE THIS: >>> distData =
> sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') WITH THIS: 
> >>>distData = sc.textFile('
> *file:///*home/user/Download/ml-10M100K/ratings.dat') That is, use '
> file:///'. I don't know if that is the correct way of specifying the URI
> for local files, or whether this just *happens to work*. The documents that
> I've read thus far haven't shown it that specified way, but I still have
> more to read.
>
>   =:)
>
>
> Thank you,
>
> ~NMV
>
>
> On 04/10/2014 04:20 PM, Alton Alexander wrote:
>
> I am doing the exact same thing for the purpose of learning. I also
> don't have a hadoop cluster and plan to scale on ec2 as soon as I get
> it working locally.
>
> I am having good success just using the binaries on and not compiling
> from source... Is there a reason why you aren't just using the
> binaries?
>
> On Thu, Apr 10, 2014 at 1:30 PM, DiData  
>  wrote:
>
>  Hello friends:
>
> I recently compiled and installed Spark v0.9 from the Apache distribution.
>
> Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually,
> the
> entire big-data suite from CDH is installed), but for the moment I'm using
> my
> manually built Apache Spark for 'ground-up' learning purposes.
>
> Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the
> following:
>
>   export SPARK_YARN=true
>   export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0
>
> The resulting examples ran fine locally as well as on YARN.
>
> I'm not interested in YARN here; just mentioning it for completeness in case
> that matters in
> my upcoming question. Here is my issue / question:
>
> I start pyspark locally -- on one machine for API learning purposes -- as
> shown below, and attempt to
> interact with a local text file (not in HDFS). Unfortunately, the
> SparkContext (sc) tries to connect to
> a HDFS Name Node (which I don't currently have enabled because I don't need
> it).
>
> The SparkContext cleverly inspects the configurations in my
> '/etc/hadoop/conf/' directory to learn
> where my Name Node is, however I don't want it to do that in this case. I
> just want it to run a
> one-machine local version of 'pyspark'.
>
> Did I miss something in my invocation/use of 'pyspark' below? Do I need to
> add something else?
>
> (Btw: I searched but could not find any solutions, and the documentation,
> while good, doesn't
> quite get me there).
>
> See below, and thank you all in advance!
>
>
> user$ export PYSPARK_PYTHON=/usr/bin/bpython
> user$ export MASTER=local[8]
> user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark
>   #
> ===
>   >>> sc
>   
>   >>>
>   >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
>   >>> distData.count()
>   [ ... snip ... ]
>   Py4JJavaError: An error occurred while calling o21.collect.
>   : java.net.ConnectException: Call From server01/192.168.0.15 to
> namenode:8020 failed on connection exception:
> java.net.ConnectException: 

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread didata.us
 

Hi: 

I believe I figured out how the behavior here: 

A file specified to SparkContext like this '/PATH/TO/SOME/FILE': 

* Will be interpreted as 'HDFS://path/to/some/file', when settings for
HDFS are present in '/ETC/HADOOP/CONF/*-SITE.XML'.
* Will be interpreted as 'FILE:///path/to/some/file', (i.e. locally)
otherwise.

I confirmed this behavior by temporarily doing this: 

* user$ sudo mv /etc/hadoop/conf /etc/hadoop/conf_

after which I re-ran my commands below. This time the SparkContext did,
indeed, look for and found, the file locally. 

In summary, '/PATH/TO/SOME/FILE' is interpreted as an in-HDFS relative
path when a HDFS configuration is found; and interpreted as an absolute
local UNIX file path when a HDFS configuration is *not* found. 

To be on the safe side, it's probably best to qualify local files with
'FILE:///' when that is what's intended; ahd with 'hdfs://' when HDFS is
what's intended. 

Hopes this helps someone. :) 

---
Sincerely,
Noel M. Vega
DiData
www.didata.us

On 2014-04-10 14:53, DiData wrote: 

> Hi Alton:
> 
> Thanks for the reply. I just wanted to build/use it from scratch to get a 
> better intuition of what's a happening.
> 
> Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue as 
> my compiled version (i.e. it, too,
> tried to access the HDFS / Name Node. Same exact error).
> 
> However, a small breakthrough. Just now I tinkered some more and found that 
> this variation works:
> 
> REPLACE THIS: >>> distData = 
> sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') WITH THIS: >>> 
> distData = sc.textFile('FILE:/// 
> [2]home/user/Download/ml-10M100K/ratings.dat') That is, use 'file:/// [2]'. I 
> don't know if that is the correct way of specifying the URI for local files, 
> or whether this just *happens to work*. The documents that I've read thus far 
> haven't shown it that specified way, but I still have more to read. 
> 
> =:)
> 
> Thank you, 
> 
> ~NMV
> 
> On 04/10/2014 04:20 PM, Alton Alexander wrote: 
> 
> I am doing the exact same thing for the purpose of learning. I also
> don't have a hadoop cluster and plan to scale on ec2 as soon as I get
> it working locally.
> 
> I am having good success just using the binaries on and not compiling
> from source... Is there a reason why you aren't just using the
> binaries?
> 
> On Thu, Apr 10, 2014 at 1:30 PM, DiData  wrote:
> 
> Hello friends:
> 
> I recently compiled and installed Spark v0.9 from the Apache distribution.
> 
> Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually,
> the
> entire big-data suite from CDH is installed), but for the moment I'm using
> my
> manually built Apache Spark for 'ground-up' learning purposes.
> 
> Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the
> following:
> 
> export SPARK_YARN=true
> export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0
> 
> The resulting examples ran fine locally as well as on YARN.
> 
> I'm not interested in YARN here; just mentioning it for completeness in case
> that matters in
> my upcoming question. Here is my issue / question:
> 
> I start pyspark locally -- on one machine for API learning purposes -- as
> shown below, and attempt to
> interact with a local text file (not in HDFS). Unfortunately, the
> SparkContext (sc) tries to connect to
> a HDFS Name Node (which I don't currently have enabled because I don't need
> it).
> 
> The SparkContext cleverly inspects the configurations in my
> '/etc/hadoop/conf/' directory to learn
> where my Name Node is, however I don't want it to do that in this case. I
> just want it to run a
> one-machine local version of 'pyspark'.
> 
> Did I miss something in my invocation/use of 'pyspark' below? Do I need to
> add something else?
> 
> (Btw: I searched but could not find any solutions, and the documentation,
> while good, doesn't
> quite get me there).
> 
> See below, and thank you all in advance!
> 
> user$ export PYSPARK_PYTHON=/usr/bin/bpython
> user$ export MASTER=local[8]
> user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark
> #
> ===
 sc
> 

 distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
 distData.count()
> [ ... snip ... ]
> Py4JJavaError: An error occurred while calling o21.collect.
> : java.net.ConnectException: Call From server01/192.168.0.15 to
> namenode:8020 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused [1]
> [ ... snip ... ]


> #
> ===
> 
> --
> Sincerely,
> DiData

-- 
Sincerely,
DiData
 

Links:
--
[1] http://wiki.apache.org/hadoop/ConnectionRefused
[2] file:///


Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread DiData

Hi Alton:

Thanks for the reply. I just wanted to build/use it from scratch to get 
a better intuition of what's a happening.


Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue 
as my compiled version (i.e. it, too,

tried to access the HDFS / Name Node. Same exact error).

However, a small breakthrough. Just now I tinkered some more and found 
that this variation works:


   REPLACE THIS: >>> distData = 
sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
   WITH THIS:>>>  distData = 
sc.textFile('*file:///*home/user/Download/ml-10M100K/ratings.dat')

That is, use 'file:///'.

I don't know if that is the correct way of specifying the URI for local files, 
or whether this just *happens to
work*. The documents that I've read thus far haven't shown it that specified 
way, but I still have more to
read.   =:)

Thank you,
~NMV


On 04/10/2014 04:20 PM, Alton Alexander wrote:

I am doing the exact same thing for the purpose of learning. I also
don't have a hadoop cluster and plan to scale on ec2 as soon as I get
it working locally.

I am having good success just using the binaries on and not compiling
from source... Is there a reason why you aren't just using the
binaries?

On Thu, Apr 10, 2014 at 1:30 PM, DiData  wrote:

Hello friends:

I recently compiled and installed Spark v0.9 from the Apache distribution.

Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually,
the
entire big-data suite from CDH is installed), but for the moment I'm using
my
manually built Apache Spark for 'ground-up' learning purposes.

Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the
following:

   export SPARK_YARN=true
   export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0

The resulting examples ran fine locally as well as on YARN.

I'm not interested in YARN here; just mentioning it for completeness in case
that matters in
my upcoming question. Here is my issue / question:

I start pyspark locally -- on one machine for API learning purposes -- as
shown below, and attempt to
interact with a local text file (not in HDFS). Unfortunately, the
SparkContext (sc) tries to connect to
a HDFS Name Node (which I don't currently have enabled because I don't need
it).

The SparkContext cleverly inspects the configurations in my
'/etc/hadoop/conf/' directory to learn
where my Name Node is, however I don't want it to do that in this case. I
just want it to run a
one-machine local version of 'pyspark'.

Did I miss something in my invocation/use of 'pyspark' below? Do I need to
add something else?

(Btw: I searched but could not find any solutions, and the documentation,
while good, doesn't
quite get me there).

See below, and thank you all in advance!


user$ export PYSPARK_PYTHON=/usr/bin/bpython
user$ export MASTER=local[8]
user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark
   #
===
   >>> sc
   
   >>>
   >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
   >>> distData.count()
   [ ... snip ... ]
   Py4JJavaError: An error occurred while calling o21.collect.
   : java.net.ConnectException: Call From server01/192.168.0.15 to
namenode:8020 failed on connection exception:
 java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
   [ ... snip ... ]
   >>>
   >>>
   #
===

--
Sincerely,
DiData


--
Sincerely,
DiData



Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Alton Alexander
I am doing the exact same thing for the purpose of learning. I also
don't have a hadoop cluster and plan to scale on ec2 as soon as I get
it working locally.

I am having good success just using the binaries on and not compiling
from source... Is there a reason why you aren't just using the
binaries?

On Thu, Apr 10, 2014 at 1:30 PM, DiData  wrote:
> Hello friends:
>
> I recently compiled and installed Spark v0.9 from the Apache distribution.
>
> Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually,
> the
> entire big-data suite from CDH is installed), but for the moment I'm using
> my
> manually built Apache Spark for 'ground-up' learning purposes.
>
> Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the
> following:
>
>   export SPARK_YARN=true
>   export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0
>
> The resulting examples ran fine locally as well as on YARN.
>
> I'm not interested in YARN here; just mentioning it for completeness in case
> that matters in
> my upcoming question. Here is my issue / question:
>
> I start pyspark locally -- on one machine for API learning purposes -- as
> shown below, and attempt to
> interact with a local text file (not in HDFS). Unfortunately, the
> SparkContext (sc) tries to connect to
> a HDFS Name Node (which I don't currently have enabled because I don't need
> it).
>
> The SparkContext cleverly inspects the configurations in my
> '/etc/hadoop/conf/' directory to learn
> where my Name Node is, however I don't want it to do that in this case. I
> just want it to run a
> one-machine local version of 'pyspark'.
>
> Did I miss something in my invocation/use of 'pyspark' below? Do I need to
> add something else?
>
> (Btw: I searched but could not find any solutions, and the documentation,
> while good, doesn't
> quite get me there).
>
> See below, and thank you all in advance!
>
>
> user$ export PYSPARK_PYTHON=/usr/bin/bpython
> user$ export MASTER=local[8]
> user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark
>   #
> ===
>   >>> sc
>   
>   >>>
>   >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
>   >>> distData.count()
>   [ ... snip ... ]
>   Py4JJavaError: An error occurred while calling o21.collect.
>   : java.net.ConnectException: Call From server01/192.168.0.15 to
> namenode:8020 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>   [ ... snip ... ]
>   >>>
>   >>>
>   #
> ===
>
> --
> Sincerely,
> DiData