NPE in Parquet

2015-01-20 Thread Alessandro Baretta
All,

I strongly suspect this might be caused by a glitch in the communication
with Google Cloud Storage where my job is writing to, as this NPE exception
shows up fairly randomly. Any ideas?

Exception in thread Thread-126 java.lang.NullPointerException
at
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at
scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at
scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:447)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:485)
at
org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:65)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:190)
at
Truven$Stats$anonfun$save_to_parquet$3$anonfun$21$anon$7.run(Truven.scala:957)


Alex


Re: Job priority

2015-01-11 Thread Alessandro Baretta
Cody,

While I might be able to improve the scheduling of my jobs by using a few
different pools with weights equal to, say, 1, 1e3 and 1e6, effectively
getting a small handful of priority classes. Still, this is really not
quite what I am describing. This is why my original post was on the dev
list. Let me then ask if there is any interest in having priority queue job
scheduling in Spark. This is something I might be able to pull off.

Alex

On Sun, Jan 11, 2015 at 6:21 AM, Cody Koeninger c...@koeninger.org wrote:

 If you set up a number of pools equal to the number of different priority
 levels you want, make the relative weights of those pools very different,
 and submit a job to the pool representing its priority, I think youll get
 behavior equivalent to a priority queue. Try it and see.

 If I'm misunderstandng what youre trying to do, then I don't know.


 On Sunday, January 11, 2015, Alessandro Baretta alexbare...@gmail.com
 wrote:

 Cody,

 Maybe I'm not getting this, but it doesn't look like this page is
 describing a priority queue scheduling policy. What this section discusses
 is how resources are shared between queues. A weight-1000 pool will get
 1000 times more resources allocated to it than a priority 1 queue. Great,
 but not what I want. I want to be able to define an Ordering on make my
 tasks representing their priority, and have Spark allocate all resources to
 the job that has the highest priority.

 Alex

 On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger c...@koeninger.org
 wrote:


 http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties

 Setting a high weight such as 1000 also makes it possible to implement
 *priority* between pools—in essence, the weight-1000 pool will always
 get to launch tasks first whenever it has jobs active.

 On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Mark,

 Thanks, but I don't see how this documentation solves my problem. You
 are referring me to documentation of fair scheduling; whereas, I am asking
 about as unfair a scheduling policy as can be: a priority queue.

 Alex

 On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 -dev, +user

 http://spark.apache.org/docs/latest/job-scheduling.html


 On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Is it possible to specify a priority level for a job, such that the
 active
 jobs might be scheduled in order of priority?

 Alex








Re: Job priority

2015-01-10 Thread Alessandro Baretta
Cody,

Maybe I'm not getting this, but it doesn't look like this page is
describing a priority queue scheduling policy. What this section discusses
is how resources are shared between queues. A weight-1000 pool will get
1000 times more resources allocated to it than a priority 1 queue. Great,
but not what I want. I want to be able to define an Ordering on make my
tasks representing their priority, and have Spark allocate all resources to
the job that has the highest priority.

Alex

On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger c...@koeninger.org wrote:


 http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties

 Setting a high weight such as 1000 also makes it possible to implement
 *priority* between pools—in essence, the weight-1000 pool will always get
 to launch tasks first whenever it has jobs active.

 On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Mark,

 Thanks, but I don't see how this documentation solves my problem. You are
 referring me to documentation of fair scheduling; whereas, I am asking
 about as unfair a scheduling policy as can be: a priority queue.

 Alex

 On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 -dev, +user

 http://spark.apache.org/docs/latest/job-scheduling.html


 On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Is it possible to specify a priority level for a job, such that the
 active
 jobs might be scheduled in order of priority?

 Alex







Re: Job priority

2015-01-10 Thread Alessandro Baretta
Mark,

Thanks, but I don't see how this documentation solves my problem. You are
referring me to documentation of fair scheduling; whereas, I am asking
about as unfair a scheduling policy as can be: a priority queue.

Alex

On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra m...@clearstorydata.com
wrote:

 -dev, +user

 http://spark.apache.org/docs/latest/job-scheduling.html


 On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta alexbare...@gmail.com
  wrote:

 Is it possible to specify a priority level for a job, such that the active
 jobs might be scheduled in order of priority?

 Alex





/tmp directory fills up

2015-01-09 Thread Alessandro Baretta
Gents,

I'm building spark using the current master branch and deploying in to
Google Compute Engine on top of Hadoop 2.4/YARN via bdutil, Google's Hadoop
cluster provisioning tool. bdutils configures Spark with

spark.local.dir=/hadoop/spark/tmp,

but this option is ignored in combination with YARN. Bdutils also
configures YARN with:

  property
nameyarn.nodemanager.local-dirs/name
value/mnt/pd1/hadoop/yarn/nm-local-dir/value
description
  Directories on the local machine in which to application temp files.
/description
  /property

This is the right directory for spark to store temporary data in. Still,
Spark is creating such directories as this:

/tmp/spark-51388ee6-9de6-411d-b9b9-ab6f9502d01e

and filling them up with gigabytes worth of output files, filling up the
very small root filesystem.

How can I diagnose why my Spark installation is not picking up the
yarn.nodemanager.local-dirs from yarn?

Alex


Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
All,

I'm using the Spark shell to interact with a small test deployment of
Spark, built from the current master branch. I'm processing a dataset
comprising a few thousand objects on Google Cloud Storage, split into a
half dozen directories. My code constructs an object--let me call it the
Dataset object--that defines a distinct RDD for each directory. The
constructor of the object only defines the RDDs; it does not actually
evaluate them, so I would expect it to return very quickly. Indeed, the
logging code in the constructor prints a line signaling the completion of
the code almost immediately after invocation, but the Spark shell does not
show the prompt right away. Instead, it spends a few minutes seemingly
frozen, eventually producing the following output:

14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process
: 9

14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process
: 759

14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process
: 228

14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process
: 3076

14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process
: 1013

14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process
: 156

This stage is inexplicably slow. What could be happening?

Thanks.


Alex


Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Denny,

No, gsutil scans through the listing of the bucket quickly. See the
following.

alex@hadoop-m:~/split$ time bash -c gsutil ls
gs://my-bucket/20141205/csv/*/*/* | wc -l

6860

real0m6.971s
user0m1.052s
sys 0m0.096s

Alex

On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex




Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Well, what do you suggest I run to test this? But more importantly, what
information would this give me?

On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:

 Oh, it makes sense of gsutil scans through this quickly, but I was
 wondering if running a Hadoop job / bdutil would result in just as fast
 scans?


 On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com
 wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex




Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?

On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Well, what do you suggest I run to test this? But more importantly, what
 information would this give me?

 On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:

 Oh, it makes sense of gsutil scans through this quickly, but I was
 wondering if running a Hadoop job / bdutil would result in just as fast
 scans?


 On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 Denny,

 No, gsutil scans through the listing of the bucket quickly. See the
 following.

 alex@hadoop-m:~/split$ time bash -c gsutil ls
 gs://my-bucket/20141205/csv/*/*/* | wc -l

 6860

 real0m6.971s
 user0m1.052s
 sys 0m0.096s

 Alex


 On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com
 wrote:

 I'm curious if you're seeing the same thing when using bdutil against
 GCS?  I'm wondering if this may be an issue concerning the transfer rate of
 Spark - Hadoop - GCS Connector - GCS.


 On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta 
 alexbare...@gmail.com wrote:

 All,

 I'm using the Spark shell to interact with a small test deployment of
 Spark, built from the current master branch. I'm processing a dataset
 comprising a few thousand objects on Google Cloud Storage, split into a
 half dozen directories. My code constructs an object--let me call it the
 Dataset object--that defines a distinct RDD for each directory. The
 constructor of the object only defines the RDDs; it does not actually
 evaluate them, so I would expect it to return very quickly. Indeed, the
 logging code in the constructor prints a line signaling the completion of
 the code almost immediately after invocation, but the Spark shell does not
 show the prompt right away. Instead, it spends a few minutes seemingly
 frozen, eventually producing the following output:

 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
 process : 9

 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
 process : 759

 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
 process : 228

 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
 process : 3076

 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
 process : 1013

 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
 process : 156

 This stage is inexplicably slow. What could be happening?

 Thanks.


 Alex




Re: Still struggling with building documentation

2014-11-11 Thread Alessandro Baretta
Nichols and Patrick,

Thanks for your help, but, no, it still does not work. The latest master
produces the following scaladoc errors:

[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55:
not found: type Type
[error]   protected Type type() { return Type.UPLOAD_BLOCK; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39:
not found: type Type
[error]   protected Type type() { return Type.STREAM_HANDLE; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
not found: type Type
[error]   protected Type type() { return Type.OPEN_BLOCKS; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
not found: type Type
[error]   protected Type type() { return Type.REGISTER_EXECUTOR; }
[error] ^

...

[error] four errors found
[error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
[error] (spark/scalaunidoc:doc) Scaladoc generation failed
[error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM
Moving back into docs dir.
Making directory api/scala
cp -r ../target/scala-2.10/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Moving to python/docs directory and building sphinx.
Makefile:14: *** The 'sphinx-build' command was not found. Make sure you
have Sphinx installed, then set the SPHINXBUILD environment variable to
point to the full path of the 'sphinx-build' executable. Alternatively you
can add the directory with the executable to your PATH. If you don't have
Sphinx installed, grab it from http://sphinx-doc.org/.  Stop.

Moving back into home dir.
Making directory api/python
cp -r python/docs/_build/html/. docs/api/python
/usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory
- python/docs/_build/html/. (Errno::ENOENT)
from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0'
from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r'
from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `top
(required)'
from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize'
from /usr/bin/jekyll:224:in `new'
from /usr/bin/jekyll:224:in `main'

What next?

Alex




On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I believe the web docs need to be built separately according to the
 instructions here
 https://github.com/apache/spark/blob/master/docs/README.md.

 Did you give those a shot?

 It's annoying to have a separate thing with new dependencies in order to
 build the web docs, but that's how it is at the moment.

 Nick

 On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com
 wrote:

 I finally came to realize that there is a special maven target to build
 the scaladocs, although arguably a very unintuitive on: mvn verify. So now
 I have scaladocs for each package, but not for the whole spark project.
 Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
 build/docs/api directory referenced in api.html is missing. How do I build
 it?

 Alex Baretta





Still struggling with building documentation

2014-11-07 Thread Alessandro Baretta
I finally came to realize that there is a special maven target to build the
scaladocs, although arguably a very unintuitive on: mvn verify. So now I
have scaladocs for each package, but not for the whole spark project.
Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
build/docs/api directory referenced in api.html is missing. How do I build
it?

Alex Baretta


Scaladoc

2014-10-30 Thread Alessandro Baretta
How do I build the scaladoc html files from the spark source distribution?

Alex Bareta