he SVD of the input matrix to the first; EOF is another name for
> PCA).
>
> This takes about 30 minutes to compute the top 20 PCs of a 46.7K-by-6.3M
> dense matrix of doubles (~2 Tb), with most of the time spent on the
> distributed matrix-vector multiplies.
>
> Best,
>
Any suggestion/opinion?
On 12-Jan-2016 2:06 pm, "Bharath Ravi Kumar" <reachb...@gmail.com> wrote:
> We're running PCA (selecting 100 principal components) on a dataset that
> has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
> matrix in question i
We're running PCA (selecting 100 principal components) on a dataset that
has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
matrix in question is mostly sparse with tens of columns populate in most
rows, but a few rows with thousands of columns populated. We're running
spark on
gher
> level tool that can run your spark jobs through one mesos framework and
> then you can let spark distribute the resources more effectively.
>
> I hope that helps!
>
> Tom.
>
> On 17 Oct 2015, at 06:47, Bharath Ravi Kumar <reachb...@gmail.com> wrote:
>
>
To be precise, the MesosExecutorBackend's Xms & Xmx equal
spark.executor.memory. So there's no question of expanding or contracting
the memory held by the executor.
On Sat, Oct 17, 2015 at 5:38 PM, Bharath Ravi Kumar <reachb...@gmail.com>
wrote:
> David, Tom,
>
> Thanks
Can someone respond if you're aware of the reason for such a memory
footprint? It seems unintuitive and hard to reason about.
Thanks,
Bharath
On Thu, Oct 15, 2015 at 12:29 PM, Bharath Ravi Kumar <reachb...@gmail.com>
wrote:
> Resending since user@mesos bounced earlier. My apologies.
&
Resending since user@mesos bounced earlier. My apologies.
On Thu, Oct 15, 2015 at 12:19 PM, Bharath Ravi Kumar <reachb...@gmail.com>
wrote:
> (Reviving this thread since I ran into similar issues...)
>
> I'm running two spark jobs (in mesos fine grained mode), each belonging t
(Reviving this thread since I ran into similar issues...)
I'm running two spark jobs (in mesos fine grained mode), each belonging to
a different mesos role, say low and high. The low:high mesos weights are
1:10. On expected lines, I see that the low priority job occupies cluster
resources to the
://spark.apache.org/docs/latest/running-on-yarn.html
Then I can see exactly whats in the directory.
Doug
ps Sorry for the dup message Bharath and Todd, used wrong email address.
On Mar 19, 2015, at 1:19 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Thanks for clarifying Todd. This may
but that was for a cloudera
installation. I am not sure what the HDP version would be to put here.
-Todd
On Wed, Mar 18, 2015 at 12:49 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi Todd,
Yes, those entries were present in the conf under the same SPARK_HOME
that was used to run spark-submit
/conf/spark-defaults.conf
file?
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
On Tue, Mar 17, 2015 at 1:04 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Still no luck running purpose-built 1.3 against HDP 2.2 after
Hi,
Trying to run spark ( 1.2.1 built for hdp 2.2) against a yarn cluster
results in the AM failing to start with following error on stderr:
Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher
An application id was assigned to the job, but there were no logs.
2a
and 2b are not required.
HTH
-Todd
On Mon, Mar 16, 2015 at 10:13 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi,
Trying to run spark ( 1.2.1 built for hdp 2.2) against a yarn cluster
results in the AM failing to start with following error on stderr:
Error: Could not find
Still no luck running purpose-built 1.3 against HDP 2.2 after following all
the instructions. Anyone else faced this issue?
On Mon, Mar 16, 2015 at 8:53 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi Todd,
Thanks for the help. I'll try again after building a distribution with the
1.3
Ok. We'll try using it in a test cluster running 1.2.
On 16-Dec-2014 1:36 am, Xiangrui Meng men...@gmail.com wrote:
Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui
On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
Hi Xiangrui,
The block size
On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
yes, I've been following the JIRA for the new ALS implementation. I'll try
it out when it's ready for testing. .
On Wed, Dec 3, 2014 at 4
will try to implement in 1.3. I'll ping you when it is ready.
Best,
Xiangrui
On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Yes, the issue appears to be due to the 2GB block size limitation. I am
hence looking for (user, product) block sizing suggestions
a very similar use case to yours (with more
constrained
hardware resources) and I haven’t seen this exact problem but I’m sure
we’ve
seen similar issues. Please let me know if you have other questions.
From: Bharath Ravi Kumar reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
.
Thanks,
Bharath
On Fri, Nov 28, 2014 at 12:00 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread
We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread across 1200 partitions
on HDFS. For the training, rank=10, and we've configured {number of user
data
save every element of the RDD as one line of text.
It works like TextOutputFormat in Hadoop MapReduce since that's what
it uses. So you are causing it to create one big string out of each
Iterable this way.
On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Thanks
approach. My bad.
On Mon, Nov 3, 2014 at 3:38 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
The result was no different with saveAsHadoopFile. In both cases, I can
see that I've misinterpreted the API docs. I'll explore the API's a bit
further for ways to save the iterable as chunks rather than
attempting to create a huge array, for example, when the number of elements
in the array are computed using an algorithm that computes an incorrect
size.”
On 2 Nov, 2014, at 12:25 pm, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Resurfacing the thread. Oom shouldn't be the norm
Hi,
I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of
count ~ 100 million. The data size is 20GB and groupBy results in an RDD of
1061 keys with values being IterableTuple4String, Integer, Double,
String. The job runs on 3 hosts in a standalone setup with each host's
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit.
On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi,
I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD
of count ~ 100 million. The data size is 20GB and groupBy results
Resurfacing the thread. Oom shouldn't be the norm for a common groupby /
sort use case in a framework that is leading in sorting bench marks? Or is
there something fundamentally wrong in the usage?
On 02-Nov-2014 1:06 am, Bharath Ravi Kumar reachb...@gmail.com wrote:
Hi,
I'm trying to run
Update: as expected, switching to kryo merely delays the inevitable. Does
anyone have experience controlling memory consumption while processing
(e.g. writing out) imbalanced partitions?
On 09-Aug-2014 10:41 am, Bharath Ravi Kumar reachb...@gmail.com wrote:
Our prototype application reads a 20GB
Our prototype application reads a 20GB dataset from HDFS (nearly 180
partitions), groups it by key, sorts by rank and write out to HDFS in that
order. The job runs against two nodes (16G, 24 cores per node available to
the job). I noticed that the execution plan results in two sortByKey
stages,
I'm looking to select the top n records (by rank) from a data set of a few
hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
entirely a driver-side operation in that all records are sorted in the
driver's memory. I prefer an approach where the records are sorted on the
cluster
Any suggestions to work around this issue ? The pre built spark binaries
don't appear to work against cdh as documented, unless there's a build
issue, which seems unlikely.
On 25-Jul-2014 3:42 pm, Bharath Ravi Kumar reachb...@gmail.com wrote:
I'm encountering a hadoop client protocol mismatch
to your build in your app?
On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Any suggestions to work around this issue ? The pre built spark binaries
don't appear to work against cdh as documented, unless there's a build
issue, which seems unlikely.
On 25-Jul
custom Spark and depending on it is a different thing from depending
on plain Spark and changing its deps. I think you want the latter.
On Fri, Jul 25, 2014 at 5:46 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Thanks for responding. I used the pre built spark binaries meant
PROCESS_LOCAL slave2 2014/07/02
16:01:28 33 s 99 ms
Any pointers / diagnosis please?
On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Thanks. I'll await the fix to re-run my test.
On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng men...@gmail.com
, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Couple more points:
1)The inexplicable stalling of execution with large feature sets appears
similar to that reported with the news-20 dataset:
http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3c53a03542.1010...@gmail.com%3E
Hi,
(Apologies for the long mail, but it's necessary to provide sufficient
details considering the number of issues faced.)
I'm running into issues testing LogisticRegressionWithSGD a two node
cluster (each node with 24 cores and 16G available to slaves out of 24G on
the system). Here's a
Hi Xiangrui ,
I'm using 1.0.0.
Thanks,
Bharath
On 18-Jun-2014 1:43 am, Xiangrui Meng men...@gmail.com wrote:
Hi Bharath,
Thanks for posting the details! Which Spark version are you using?
Best,
Xiangrui
On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote
, Long, Integer,
Integer into a JavaPairRDDTuple2Long,Long, Tuple2Integer,Integer is
unrelated to mllib.
Thanks,
Bharath
On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi Xiangrui ,
I'm using 1.0.0.
Thanks,
Bharath
On 18-Jun-2014 1:43 am, Xiangrui Meng men
Hi,
I'm running the spark server with a single worker on a laptop using the
docker images. The spark shell examples run fine with this setup. However,
a standalone java client that tries to run wordcount on a local files (1 MB
in size), the execution fails with the following error on the stdout
38 matches
Mail list logo