cartesian on pyspark not paralleised

2014-12-05 Thread Antony Mayi
Hi, using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel - I can seen multiple python processes spawned on each nodemanager but from some reason when running cartesian there is only single python process running on each node. the task is indicating thousands of partitions

Re: custom python converter from HBase Result to tuple

2014-12-22 Thread Antony Mayi
? thanks,Antony. On Monday, 22 December 2014, 20:09, Ted Yu yuzhih...@gmail.com wrote: Which HBase version are you using ? Can you show the full stack trace ? Cheers On Mon, Dec 22, 2014 at 11:02 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, can anyone please give me

saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing with the example from the stock hbase_outputformat.py. anyone having same issue? (and able to solve?)

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
and pastebin it ? Thanks On Wed, Dec 24, 2014 at 4:49 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, have been using this without any issues with spark 1.1.0 but after upgrading to 1.2.0 saving a RDD from pyspark using saveAsNewAPIHadoopDataset into HBase just hangs - even when testing

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
) -         at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) On Wed, Dec 24, 2014 at 3:11 PM, Antony Mayi antonym...@yahoo.com wrote: this is it (jstack of particular yarn container) - http://pastebin.com/eAdiUYKK thanks, Antony. On Wednesday, 24 December 2014, 16:34, Ted Yu yuzhih...@gmail.com

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
24, 2014 at 4:11 PM, Antony Mayi antonym...@yahoo.com wrote: I just run it by hand from pyspark shell. here is the steps: pyspark --jars /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar conf = {hbase.zookeeper.quorum: localhost, ...         hbase.mapred.outputtable: test

Re: saveAsNewAPIHadoopDataset against hbase hanging in pyspark 1.2.0

2014-12-24 Thread Antony Mayi
                         column=f1:asd, timestamp=1419463092904, value=456                                       testkey                       column=f1:testqual, timestamp=1419487275905, value=testval                             2 row(s) in 0.0270 seconds On Thursday, 25 December 2014, 6:58, Antony Mayi antonym

pyspark executor PYTHONPATH

2015-01-02 Thread Antony Mayi
Hi, I am running spark 1.1.0 on yarn. I have custom set of modules installed under same location on each executor node and wondering how can I pass the executors the PYTHONPATH so that they can use the modules. I've tried this: spark-env.sh:export PYTHONPATH=/tmp/test/

Re: pyspark executor PYTHONPATH

2015-01-02 Thread Antony Mayi
ok, I see now what's happening - the pkg.mod.test is serialized by reference and there is nothing actually trying to import pkg.mod on the executors so the reference is broken. so how can I get the pkg.mod imported on the executors? thanks,Antony. On Friday, 2 January 2015, 13:49, Antony

HW imbalance

2015-01-26 Thread Antony Mayi
Hi, is it possible to mix hosts with (significantly) different specs within a cluster (without wasting the extra resources)? for example having 10 nodes with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to utilize the extra memory by spark executors (as my

Re: HW imbalance

2015-01-26 Thread Antony Mayi
is uniform across the cluster or for the machine where the config file resides.) On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, is it possible to mix hosts with (significantly) different specs within a cluster (without wasting the extra resources)? for example

pyspark importing custom module

2015-02-06 Thread Antony Mayi
Hi, is there a way to use custom python module that is available to all executors under PYTHONPATH (without a need to upload it using sc.addPyFile()) - bit weird that this module is on all nodes yet the spark tasks can't use it (references to its objects are serialized and sent to all executors

remote Akka client disassociated - some timeout?

2015-01-16 Thread Antony Mayi
Hi, I believe this is some kind of timeout problem but can't figure out how to increase it. I am running spark 1.2.0 on yarn (all from cdh 5.3.0). I submit a python task which first loads big RDD from hbase - I can see in the screen output all executors fire up then no more logging output for

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-16 Thread Antony Mayi
. is there anything more I can do? thanks,Antony. On Monday, 12 January 2015, 8:21, Antony Mayi antonym...@yahoo.com wrote: this seems to have sorted it, awesome, thanks for great help.Antony. On Sunday, 11 January 2015, 13:02, Sean Owen so...@cloudera.com wrote: I would expect

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-17 Thread Antony Mayi
have a look at http://spark.apache.org/docs/latest/running-on-yarn.html ... --executor-memory 22g --conf spark.yarn.executor.memoryOverhead=2g ... should do it, off the top of my head. That should reserve 24g from YARN. On Sat, Jan 17, 2015 at 5:29 AM, Antony Mayi antonym...@yahoo.com wrote

storing MatrixFactorizationModel (pyspark)

2015-02-19 Thread Antony Mayi
Hi, when getting the model out of ALS.train it would be beneficial to store it (to disk) so the model can be reused later for any following predictions. I am using pyspark and I had no luck pickling it either using standard pickle module or even dill. does anyone have a solution for this (note

loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
Hi, I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running spark 1.2.0 in yarn-client mode with following layout: spark.executor.cores=4 spark.executor.memory=28G spark.yarn.executor.memoryOverhead=4096 I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
= 128GB of RAM, not 1TB. It still feels like this shouldn't be running out of memory, not by a long shot though. But just pointing out potential differences between what you are expecting and what you are configuring. On Thu, Feb 19, 2015 at 9:56 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
rather than split up your memory, but at some point it becomes counter-productive. 32GB is a fine executor size. So you have ~8GB available per task which seems like plenty. Something else is at work here. Is this error form your code's stages or ALS? On Thu, Feb 19, 2015 at 10:07 AM, Antony Mayi

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
container_1424204221358_0013_01_08 transitioned from RUNNING to EXITED_WITH_FAILURE2015-02-19 12:08:14,455 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1424204221358_0013_01_08 Antony. On Thursday, 19 February 2015, 11:54, Antony Mayi

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Antony Mayi
help you get rid of files in the workers that are not needed. TD On Mon, Feb 16, 2015 at 12:27 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: spark.cleaner.ttl is not the right way - seems to be really designed for streaming. although it keeps the disk usage under control it also causes loss

Re: storing MatrixFactorizationModel (pyspark)

2015-02-20 Thread Antony Mayi
. the matrix model had two RDD vectors representing the decomposed matrix. You can save these to disk and re use them. On Thu, Feb 19, 2015 at 2:19 AM Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, when getting the model out of ALS.train it would be beneficial to store it (to disk) so the model

java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Antony Mayi
Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used:spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Antony Mayi
Cc: Sandy Ryza sandy.r...@cloudera.com, Antony Mayi antonym...@yahoo.com, user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Since it's an executor running OOM it doesn't look like a container being killed by YARN to me

pyspark and and page allocation failures due to memory fragmentation

2015-01-30 Thread Antony Mayi
Hi, When running big mapreduce operation with pyspark (in the particular case using lot of sets and operations on sets in the map tasks so likely to be allocating and freeing loads of pages) I eventually get kernel error 'python: page allocation failure: order:10, mode:0x2000d0' plus very

ALS.trainImplicit running out of mem when using higher rank

2015-01-10 Thread Antony Mayi
the memory requirements seem to be rapidly growing hen using higher rank... I am unable to get over 20 without running out of memory. is this expected?thanks, Antony. 

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-10 Thread Antony Mayi
container. thanks for any ideas,Antony. On Saturday, 10 January 2015, 10:11, Antony Mayi antonym...@yahoo.com wrote: the memory requirements seem to be rapidly growing hen using higher rank... I am unable to get over 20 without running out of memory. is this expected?thanks

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-11 Thread Antony Mayi
input and parameters? thanks,Antony. On Saturday, 10 January 2015, 10:47, Antony Mayi antonym...@yahoo.com wrote: the actual case looks like this:* spark 1.1.0 on yarn (cdh 5.2.1)* ~8-10 executors, 36GB phys RAM per host* input RDD is roughly 3GB containing ~150-200M items

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-11 Thread Antony Mayi
) You can try increasing it to a couple GB. On Sun, Jan 11, 2015 at 9:43 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: the question really is whether this is expected that the memory requirements grow rapidly with the rank... as I would expect memory is rather O(1) problem with dependency

spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Antony Mayi
Hi, I am using newAPIHadoopRDD to load RDD from hbase (using pyspark running as yarn-client) - pretty much the standard case demonstrated in the hbase_inputformat.py from examples... the thing is the when trying the very same code on spark 1.2 I am getting the error bellow which based on

java.io.IOException: sendMessageReliably failed without being ACK'd

2015-01-14 Thread Antony Mayi
Hi, running spark 1.1.0 in yarn-client mode (cdh 5.2.1) on XEN based cloud and randomly getting my executors failing on errors like bellow. I suspect it is some cloud networking issue (XEN driver bug?) but wondering if there is any spark/yarn workaround that I could use to mitigate?

spark-local dir running out of space during long ALS run

2015-02-15 Thread Antony Mayi
Hi, I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about 3 billions of ratings and I am doing several trainImplicit() runs in loop within one spark session. I have four node cluster with 3TB disk space on each. before starting the job there is less then 8% of the disk

Re: spark-local dir running out of space during long ALS run

2015-02-15 Thread Antony Mayi
spark.cleaner.ttl ? On Sunday, 15 February 2015, 18:23, Antony Mayi antonym...@yahoo.com wrote: Hi, I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about 3 billions of ratings and I am doing several trainImplicit() runs in loop within one spark session

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Antony Mayi
, Antony Mayi antonym...@yahoo.com wrote: spark.cleaner.ttl ? On Sunday, 15 February 2015, 18:23, Antony Mayi antonym...@yahoo.com wrote: Hi, I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about 3 billions of ratings and I am doing several

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Antony Mayi
experience might be able to give more useful directions about how to fix that. On Wed, Jan 7, 2015 at 1:46 PM, Antony Mayi antonym...@yahoo.com wrote: this is official cloudera compiled stack cdh 5.3.0 - nothing has been done by me and I presume they are pretty good in building it so I still

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Antony Mayi
/ Hadoop provided by the cluster. It could also be that you're pairing Spark compiled for Hadoop 1.x with a 2.x cluster. On Wed, Jan 7, 2015 at 9:38 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I am using newAPIHadoopRDD to load RDD from hbase (using pyspark running as yarn-client

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Antony Mayi
19, 2015 at 5:10 AM Antony Mayi antonym...@yahoo.com.invalid wrote: now with reverted spark.shuffle.io.preferDirectBufs (to true) getting again GC overhead limit exceeded: === spark stdout ===15/02/19 12:08:08 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 18.0 (TID 5329, 192.168.1.93

shuffle data taking immense disk space during ALS

2015-02-23 Thread Antony Mayi
Hi, This has already been briefly discussed here in the past but there seems to be more questions... I am running bigger ALS task with input data ~40GB (~3 billions of ratings). The data is partitioned into 512 partitions and I am also using default parallelism set to 512. The ALS runs with

IF in SQL statement

2015-05-16 Thread Antony Mayi
Hi, is it expected I can't reference a column inside of IF statement like this: sctx.sql(SELECT name, IF(ts0, price, 0) FROM table).collect() I get an error: org.apache.spark.sql.AnalysisException: unresolved operator 'Project [name#0,if ((CAST(ts#1, DoubleType) CAST(0, DoubleType))) price#2

synchronizing streams of different kafka topics

2015-11-17 Thread Antony Mayi
Hi, I have two streams coming from two different kafka topics. the two topics contain time related events but are quite asymmetric in volume. I would obviously need to process them in sync to get the time related events together but with same processing rate if the heavier stream starts

imposed dynamic resource allocation

2015-12-11 Thread Antony Mayi
Hi, using spark 1.5.2 on yarn (client mode) and was trying to use the dynamic resource allocation but it seems once it is enabled by first app then any following application is managed that way even if explicitly disabling. example:1) yarn configured with 

Re: pyspark streaming crashes

2016-01-04 Thread Antony Mayi
just for reference in my case this problem is caused by this bug:  https://issues.apache.org/jira/browse/SPARK-12617 On Monday, 21 December 2015, 14:32, Antony Mayi <antonym...@yahoo.com> wrote: I noticed it might be related to longer GC pauses (1-2 sec) - the crash usually

pyspark streaming crashes

2015-12-20 Thread Antony Mayi
Hi, can anyone please help me troubleshooting this prob: I have a streaming pyspark application (spark 1.5.2 on yarn-client) which keeps crashing after few hours. Doesn't seem to be running out of mem neither on driver or executors. driver error: py4j.protocol.Py4JJavaError: An error occurred

Re: pyspark streaming crashes

2015-12-21 Thread Antony Mayi
I noticed it might be related to longer GC pauses (1-2 sec) - the crash usually occurs after such pause. could that be causing the python-java gateway timing out? On Sunday, 20 December 2015, 23:05, Antony Mayi <antonym...@yahoo.com> wrote: Hi, can anyone please h

driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Antony Mayi
I have streaming app (pyspark 1.5.2 on yarn) that's crashing due to driver (jvm part, not python) OOM (no matter how big heap is assigned, eventually runs out). When checking the heap it is all taken by "byte" items of io.netty.buffer.PoolThreadCache. The number of

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Antony Mayi
-of-io-netty-buffer-poolthreadcachememoryregioncacheentry-instances On Tue, Dec 22, 2015 at 2:59 AM, Antony Mayi <antonym...@yahoo.com.invalid> wrote: I have streaming app (pyspark 1.5.2 on yarn) that's crashing due to driver (jvm part, not python) OOM (no matter how big heap is assigned

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-23 Thread Antony Mayi
fyi after further troubleshooting logging this as  https://issues.apache.org/jira/browse/SPARK-12511 On Tuesday, 22 December 2015, 18:16, Antony Mayi <antonym...@yahoo.com> wrote: I narrowed it down to problem described for example here:  https://bugs.openjdk.java.net/brow