SizeEstimator in Spark 1.1 and high load/object allocation when reading in data

2014-10-30 Thread Erik Freed
Hi All,

We have recently moved to Spark 1.1 from 0.9 for an application handling a
fair number of very large datasets partitioned across multiple nodes. About
half of each of these large datasets is stored in off heap byte arrays and
about half in the standard Java heap.

While these datasets are being loaded from our custom HDFS 2.3 RDD and
before we are using even a fraction of the available Java Heap and the
native off heap memory the loading slows to an absolute crawl. It appears
clear from our profiling of the Spark Executor that in the Spark
SizeEstimator an extremely high cpu load is being demanded along with a
fast and furious allocation of Object[] instances.  We do not believe we
were seeing this sort of behavior in 0.9 and we have noticed rather
significant changes in this part of the BlockManager code going from 0.9 to
1.1 and beyond. A GC run gets rid of all of the Object[] instances.

Before we start spending large amounts of time either switching back to 0.9
or further tracing to the root cause of this, I was wondering if anyone out
there had enough experience with that part of the code (or had run into the
same problem) and could help us understand what sort of root causes might
lay behind this strange behavior and even better what we could do to
resolve them.

Any help would be very much appreciated.

cheers,
Erik


Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Thanks all for the update - I have actually built using those options every
which way I can think of so perhaps this is something I am doing about how
I upload the jar to our artifactory repo server. Anyone have a working pom
file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo
server?

cheers,
Erik


On Fri, Apr 4, 2014 at 7:54 AM, Rahul Singhal wrote:

>   Hi Erik,
>
>  I am working with TOT branch-0.9 (> 0.9.1) and the following works for
> me for maven build:
>
>  export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
> -XX:ReservedCodeCacheSize=512m"
> mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean
> package
>
>
>  And from http://spark.apache.org/docs/latest/running-on-yarn.html, for
> sbt build, you could try:
>
>  SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
>
>   Thanks,
> Rahul Singhal
>
>   From: Erik Freed 
> Reply-To: "user@spark.apache.org" 
> Date: Friday 4 April 2014 7:58 PM
> To: "user@spark.apache.org" 
> Subject: Hadoop 2.X Spark Client Jar 0.9.0 problem
>
>   Hi All,
>
>  I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
> already being addressed, but I am having a devil of a time with a spark
> 0.9.0 client jar for hadoop 2.X. If I go to the site and download:
>
>
>- Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror 
> <http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz>
>or direct file 
> download<http://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz>
>
> I get a jar with what appears to be hadoop 1.0.4 that fails when using
> hadoop 2.3.0. I have tried repeatedly to build the source tree with the
> correct options per the documentation but always seemingly ending up with
> hadoop 1.0.4.  As far as I can tell the reason that the jar available on
> the web site doesn't have the correct hadoop client in it, is because the
> build itself is having that problem.
>
>  I am about to try to troubleshoot the build but wanted to see if anyone
> out there has encountered the same problem and/or if I am just doing
> something dumb (!)
>
>
>  Anyone else using hadoop 2.X? How do you get the right client jar if so?
>
>  cheers,
> Erik
>
>  --
> Erik James Freed
> CoDecision Software
> 510.859.3360
> erikjfr...@codecision.com
>
> 1480 Olympus Avenue
> Berkeley, CA
> 94708
>
> 179 Maria Lane
> Orcas, WA
> 98245
>



-- 
Erik James Freed
CoDecision Software
510.859.3360
erikjfr...@codecision.com

1480 Olympus Avenue
Berkeley, CA
94708

179 Maria Lane
Orcas, WA
98245


Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Hi All,

I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
already being addressed, but I am having a devil of a time with a spark
0.9.0 client jar for hadoop 2.X. If I go to the site and download:


   - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache
mirror 

   or direct file
download

I get a jar with what appears to be hadoop 1.0.4 that fails when using
hadoop 2.3.0. I have tried repeatedly to build the source tree with the
correct options per the documentation but always seemingly ending up with
hadoop 1.0.4.  As far as I can tell the reason that the jar available on
the web site doesn't have the correct hadoop client in it, is because the
build itself is having that problem.

I am about to try to troubleshoot the build but wanted to see if anyone out
there has encountered the same problem and/or if I am just doing something
dumb (!)


Anyone else using hadoop 2.X? How do you get the right client jar if so?

cheers,
Erik

-- 
Erik James Freed
CoDecision Software
510.859.3360
erikjfr...@codecision.com

1480 Olympus Avenue
Berkeley, CA
94708

179 Maria Lane
Orcas, WA
98245