[GraphX] - OOM Java Heap Space

2018-10-28 Thread Thodoris Zois
Hello,

I have the edges of a graph stored as parquet files (about 3GB). I am loading 
the graph and trying to compute the total number of triplets and triangles. 
Here is my code:

val edges_parq = sqlContext.read.option("header","true").parquet(args(0) + 
"/year=" + year) 
val edges: RDD[Edge[Int]] = edges_parq.rdd.map(row => 
Edge(row(0).asInstanceOf[Int].toInt, row(1).asInstanceOf[Int].toInt))
val graph = Graph.fromEdges(edges, 
1.toInt).partitionBy(PartitionStrategy.RandomVertexCut)

// The actual computation
var numberOfTriplets = graph.triplets.count
val tmp =  graph.triangleCount().vertices.filter{ case (vid, count) => count > 
0 }
var numberOfTriangles = tmp.map(a => a._2).sum()

Even though it manages to compute the number of triplets, I can’t compute the 
number of triangles. Every time I get an exception OOM - Java Heap Space on 
some executors and the application fails.
I am using 100 executors (1 core and 6GBs per executor). I have tried to use 
'hdfsConf.set("mapreduce.input.fileinputformat.split.maxsize", "33554432”)’ in 
the code but still no results.

Here are some of my configurations:
--conf spark.driver.memory=20G 
--conf spark.driver.maxResultSize=20G 
--conf spark.yarn.executor.memoryOverhead=6144 

- Thodoris
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark-GraphX] Conductance, Bridge Ratio & Diameter

2018-10-18 Thread Thodoris Zois
Hello, 

I am trying to compute conductance, bridge ratio and diameter on a given graph 
but I face some problems.

- For the conductance my problem is how to compute the cuts so that they are 
kinda semi-clustered. Is the partitioningBy from GraphX related to dividing a 
graph into multiple subgraphs that are clustered or semi-clustered together? If 
not, the could you please provide me some help? (an open-source implementation 
or something so I can proceed).

- For the bridge ratio I have made an implementation but it is too naive and it 
takes a lot of time to finish. So if anybody can provide some help then I would 
be really grateful. 

- For the diameter i found: (https://github.com/Cecca/graphx-diameter 
) but after a certain point even if 
the graph is 300MB it hangs on reducing and it keeps “running” for 14 hours. 
After a certain point I sometimes get SparkListenerBus has stopped. Any ideas?

Thank you very much for your help!
- Thodoris 

Re: Spark on Mesos - Weird behavior

2018-07-23 Thread Thodoris Zois
Hi Susan,

This is exactly what we have used. Thank you for your interest!

- Thodoris 

> On 23 Jul 2018, at 20:55, Susan X. Huynh  wrote:
> 
> Hi Thodoris,
> 
> Maybe setting "spark.scheduler.minRegisteredResourcesRatio" to > 0 would 
> help? Default value is 0 with Mesos.
> 
> "The minimum ratio of registered resources (registered resources / total 
> expected resources) (resources are executors in yarn mode and Kubernetes 
> mode, CPU cores in standalone mode and Mesos coarsed-grained mode 
> ['spark.cores.max' value is total expected resources for Mesos coarse-grained 
> mode] ) to wait for before scheduling begins. Specified as a double between 
> 0.0 and 1.0. Regardless of whether the minimum ratio of resources has been 
> reached, the maximum amount of time it will wait before scheduling begins is 
> controlled by configspark.scheduler.maxRegisteredResourcesWaitingTime." - 
> https://spark.apache.org/docs/latest/configuration.html
> 
> Susan
> 
>> On Wed, Jul 11, 2018 at 7:22 AM, Pavel Plotnikov 
>>  wrote:
>> Oh, sorry, i missed that you use spark without dynamic allocation. Anyway, i 
>> don't know does this parameters works without dynamic allocation. 
>> 
>>> On Wed, Jul 11, 2018 at 5:11 PM Thodoris Zois  wrote:
>>> Hello,
>>> 
>>> Yeah you are right, but I think that works only if you use Spark dynamic 
>>> allocation. Am I wrong?
>>> 
>>> -Thodoris
>>> 
>>>> On 11 Jul 2018, at 17:09, Pavel Plotnikov  
>>>> wrote:
>>>> 
>>>> Hi, Thodoris
>>>> You can configure resources per executor and manipulate with number of 
>>>> executers instead using spark.max.cores. I think 
>>>> spark.dynamicAllocation.minExecutors and 
>>>> spark.dynamicAllocation.maxExecutors configuration values can help you.
>>>> 
>>>>> On Tue, Jul 10, 2018 at 5:07 PM Thodoris Zois  wrote:
>>>>> Actually after some experiments we figured out that spark.max.cores / 
>>>>> spark.executor.cores is the upper bound for the executors. Spark apps 
>>>>> will run even only if one executor can be launched. 
>>>>> 
>>>>> Is there any way to specify also the lower bound? It is a bit annoying 
>>>>> that seems that we can’t control the resource usage of an application. By 
>>>>> the way, we are not using dynamic allocation. 
>>>>> 
>>>>> - Thodoris 
>>>>> 
>>>>> 
>>>>>> On 10 Jul 2018, at 14:35, Pavel Plotnikov 
>>>>>>  wrote:
>>>>>> 
>>>>>> Hello Thodoris!
>>>>>> Have you checked this:
>>>>>>  - does mesos cluster have available resources?
>>>>>>   - if spark have waiting tasks in queue more than 
>>>>>> spark.dynamicAllocation.schedulerBacklogTimeout configuration value?
>>>>>>  - And then, have you checked that mesos send offers to spark app mesos 
>>>>>> framework at least with 10 cores and 2GB RAM?
>>>>>> 
>>>>>> If mesos have not available offers with 10 cores, for example, but have 
>>>>>> with 8 or 9, so you can use smaller executers for better fit for 
>>>>>> available resources on nodes for example with 4 cores and 1 GB RAM, for 
>>>>>> example
>>>>>> 
>>>>>> Cheers,
>>>>>> Pavel
>>>>>> 
>>>>>>> On Mon, Jul 9, 2018 at 9:05 PM Thodoris Zois  wrote:
>>>>>>> Hello list,
>>>>>>> 
>>>>>>> We are running Apache Spark on a Mesos cluster and we face a weird 
>>>>>>> behavior of executors. When we submit an app with e.g 10 cores and 2GB 
>>>>>>> of memory and max cores 30, we expect to see 3 executors running on the 
>>>>>>> cluster. However, sometimes there are only 2... Spark applications are 
>>>>>>> not the only one that run on the cluster. I guess that Spark starts 
>>>>>>> executors on the available offers even if it does not satisfy our 
>>>>>>> needs. Is there any configuration that we can use in order to prevent 
>>>>>>> Spark from starting when there are no resource offers for the total 
>>>>>>> number of executors?
>>>>>>> 
>>>>>>> Thank you 
>>>>>>> - Thodoris 
>>>>>>> 
>>>>>>> -
>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>> 
>>> 
> 
> 
> 
> -- 
> Susan X. Huynh
> Software engineer, Data Agility
> xhu...@mesosphere.com


Re: Spark on Mesos - Weird behavior

2018-07-11 Thread Thodoris Zois
Hello,

Yeah you are right, but I think that works only if you use Spark dynamic 
allocation. Am I wrong?

-Thodoris

> On 11 Jul 2018, at 17:09, Pavel Plotnikov  
> wrote:
> 
> Hi, Thodoris
> You can configure resources per executor and manipulate with number of 
> executers instead using spark.max.cores. I think 
> spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors 
> configuration values can help you.
> 
> On Tue, Jul 10, 2018 at 5:07 PM Thodoris Zois  <mailto:z...@ics.forth.gr>> wrote:
> Actually after some experiments we figured out that spark.max.cores / 
> spark.executor.cores is the upper bound for the executors. Spark apps will 
> run even only if one executor can be launched. 
> 
> Is there any way to specify also the lower bound? It is a bit annoying that 
> seems that we can’t control the resource usage of an application. By the way, 
> we are not using dynamic allocation. 
> 
> - Thodoris 
> 
> 
> On 10 Jul 2018, at 14:35, Pavel Plotnikov  <mailto:pavel.plotni...@team.wrike.com>> wrote:
> 
>> Hello Thodoris!
>> Have you checked this:
>>  - does mesos cluster have available resources?
>>   - if spark have waiting tasks in queue more than 
>> spark.dynamicAllocation.schedulerBacklogTimeout configuration value?
>>  - And then, have you checked that mesos send offers to spark app mesos 
>> framework at least with 10 cores and 2GB RAM?
>> 
>> If mesos have not available offers with 10 cores, for example, but have with 
>> 8 or 9, so you can use smaller executers for better fit for available 
>> resources on nodes for example with 4 cores and 1 GB RAM, for example
>> 
>> Cheers,
>> Pavel
>> 
>> On Mon, Jul 9, 2018 at 9:05 PM Thodoris Zois > <mailto:z...@ics.forth.gr>> wrote:
>> Hello list,
>> 
>> We are running Apache Spark on a Mesos cluster and we face a weird behavior 
>> of executors. When we submit an app with e.g 10 cores and 2GB of memory and 
>> max cores 30, we expect to see 3 executors running on the cluster. However, 
>> sometimes there are only 2... Spark applications are not the only one that 
>> run on the cluster. I guess that Spark starts executors on the available 
>> offers even if it does not satisfy our needs. Is there any configuration 
>> that we can use in order to prevent Spark from starting when there are no 
>> resource offers for the total number of executors?
>> 
>> Thank you 
>> - Thodoris 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 



Re: Spark on Mesos - Weird behavior

2018-07-10 Thread Thodoris Zois
Actually after some experiments we figured out that spark.max.cores / 
spark.executor.cores is the upper bound for the executors. Spark apps will run 
even only if one executor can be launched. 

Is there any way to specify also the lower bound? It is a bit annoying that 
seems that we can’t control the resource usage of an application. By the way, 
we are not using dynamic allocation. 

- Thodoris 


> On 10 Jul 2018, at 14:35, Pavel Plotnikov  
> wrote:
> 
> Hello Thodoris!
> Have you checked this:
>  - does mesos cluster have available resources?
>   - if spark have waiting tasks in queue more than 
> spark.dynamicAllocation.schedulerBacklogTimeout configuration value?
>  - And then, have you checked that mesos send offers to spark app mesos 
> framework at least with 10 cores and 2GB RAM?
> 
> If mesos have not available offers with 10 cores, for example, but have with 
> 8 or 9, so you can use smaller executers for better fit for available 
> resources on nodes for example with 4 cores and 1 GB RAM, for example
> 
> Cheers,
> Pavel
> 
>> On Mon, Jul 9, 2018 at 9:05 PM Thodoris Zois  wrote:
>> Hello list,
>> 
>> We are running Apache Spark on a Mesos cluster and we face a weird behavior 
>> of executors. When we submit an app with e.g 10 cores and 2GB of memory and 
>> max cores 30, we expect to see 3 executors running on the cluster. However, 
>> sometimes there are only 2... Spark applications are not the only one that 
>> run on the cluster. I guess that Spark starts executors on the available 
>> offers even if it does not satisfy our needs. Is there any configuration 
>> that we can use in order to prevent Spark from starting when there are no 
>> resource offers for the total number of executors?
>> 
>> Thank you 
>> - Thodoris 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


Spark on Mesos - Weird behavior

2018-07-09 Thread Thodoris Zois
Hello list,

We are running Apache Spark on a Mesos cluster and we face a weird behavior of 
executors. When we submit an app with e.g 10 cores and 2GB of memory and max 
cores 30, we expect to see 3 executors running on the cluster. However, 
sometimes there are only 2... Spark applications are not the only one that run 
on the cluster. I guess that Spark starts executors on the available offers 
even if it does not satisfy our needs. Is there any configuration that we can 
use in order to prevent Spark from starting when there are no resource offers 
for the total number of executors?

Thank you 
- Thodoris 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.3 driver pod stuck in Running state — Kubernetes

2018-06-08 Thread Thodoris Zois
As far as I know from Mesos with Spark, it is a running state and not a pending 
one. What you see is normal, but if I am wrong somebody correct me.

 Spark driver at start operates normally (running state) but when it comes to 
start up executors, then it cannot allocate resources for them and hangs.. 

- Thodoris

> On 8 Jun 2018, at 21:24, purna pradeep  wrote:
> 
> Hello,
> When I run spark-submit on k8s cluster I’m
> 
> Seeing driver pod stuck in Running state and when I pulled driver pod logs 
> I’m able to see below log
> 
> I do understand that this warning might be because of lack of cpu/ Memory , 
> but I expect driver pod be in “Pending” state rather than “ Running” state 
> though actually it’s not Running 
> 
> So I had kill the driver pod and resubmit the job 
> 
> Please suggest here !
> 
> 2018-06-08 14:38:01 WARN TaskSchedulerImpl:66 - Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> 
> 2018-06-08 14:38:16 WARN TaskSchedulerImpl:66 - Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> 
> 2018-06-08 14:38:31 WARN TaskSchedulerImpl:66 - Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> 
> 2018-06-08 14:38:46 WARN TaskSchedulerImpl:66 - Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> 
> 2018-06-08 14:39:01 WARN TaskSchedulerImpl:66 - Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources


Re: Read or save specific blocks of a file

2018-05-03 Thread Thodoris Zois
Hello Madhav,
What I did is pretty straight-forward. Let's say that your HDFS block
is 128 MB and you store a file of 256 MBs in HDFS, named Test.csv.
First use the command: `hdfs fsck Test.csv -locations -blocks -files`.
It will return you some very useful information including the list of
blocks. So let's say that you want to read the first block (block 0).
On the right side of the line that corresponds to block 0 you can find
the IP of the machine that holds this specific block in the local file
system as well as the blockName (BP-1737920335-xxx.xxx.x.x-
1510660262864) and blockID (e.g: blk_1073760915_20091) that will help
you later recognize it. So what you need from fsck is the blockName,
blockID and the IP of the machine that has the specific block that you
are interested in.
After you get these you got everything you need. All you have to do is
to connect to the specific IP and execute: `find /data/hdfs-
data/datanode/current/blockName/current/finalized/subdir0/ -name
blockID`. That command will return you the full path where you can find
the contents of your file Test.csv that correspond to one block in
HDFS.
What I do after I get the full path is to copy the file, remove the
last line (because there is a big chance that the last line will be
included in the next block) and store it again to HDFS with the desired
name. Then I can access one block of file Test.csv from HDFS. That's
all, if you need any further information do no hesitate to contact me.
- Thodoris

On Thu, 2018-05-03 at 14:47 +0530, Madhav A wrote:
> Thodoris,
> 
> 
> I certainly would be interested in knowing how you were able to
> identify individual blocks and read from them. I was understanding
> that HDFS protocol abstracts this from the consumers to prevent
> potential data corruption issues. Appreciate if you please share some
> details of your approach.
> 
> 
> Thanks!
> madhav
> On Wed, May 2, 2018 at 3:34 AM, Thodoris Zois 
> wrote:
> > That’s what I did :) If you need further information I can post my
> > solution.. 
> > 
> > - Thodoris
> > On 30 Apr 2018, at 22:23, David Quiroga 
> > wrote:
> > 
> > > There might be a better way... but I wonder if it might be
> > > possible to access the node where the block is store and read it
> > > from the local file system rather than from HDFS.  
> > > On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois  > > r> wrote:
> > > > Hello list,
> > > > 
> > > > 
> > > > 
> > > > I have a file on HDFS that is divided into 10 blocks
> > > > (partitions). 
> > > > 
> > > > 
> > > > 
> > > > Is there any way to retrieve data from a specific block? (e.g:
> > > > using
> > > > 
> > > > the blockID). 
> > > > 
> > > > 
> > > > 
> > > > Except that, is there any option to write the contents of each
> > > > block
> > > > 
> > > > (or of one block) into separate files?
> > > > 
> > > > 
> > > > 
> > > > Thank you very much,
> > > > 
> > > > Thodoris 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > -
> > > > 
> > > > 
> > > > To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> > > > 
> > > > For additional commands, e-mail: user-h...@hadoop.apache.org
> > > > 
> > > > 
> > > > 

Re: ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Thodoris Zois
I am in CentOS 7 and I use Spark 2.3.0. Below I have posted my code. Logistic 
regression took 85 minutes and linear regression 127 seconds… 

My dataset as I said is 128 MB and contains: 1000 features and ~100 classes. 


#SparkSession
ss = SparkSession.builder.getOrCreate()


start = time.time()

#Read data
trainData = ss.read.format("csv").option("inferSchema","true").load(file)

#Calculate Features
assembler = VectorAssembler(inputCols=trainData.columns[1:], 
outputCol="features")
trainData = assembler.transform(trainData)

#Drop columns
dropColumns = trainData.columns
dropColumns = [e for e in dropColumns if e not in ('_c0', 'features')]
trainData = trainData.drop(*dropColumns)

#Rename column from _c0 to label
trainData = trainData.withColumnRenamed("_c0", "label")

#Logistic regression
lr = LogisticRegression(maxIter=500, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(trainData)

#Output Coefficients
print("Coefficients: " + str(lrModel.coefficientMatrix))



- Thodoris


> On 27 Apr 2018, at 22:50, Irving Duran  wrote:
> 
> Are you reformatting the data correctly for logistic (meaning 0 & 1's) before 
> modeling?  What are OS and spark version you using?
> 
> Thank You,
> 
> Irving Duran
> 
> 
> On Fri, Apr 27, 2018 at 2:34 PM Thodoris Zois  <mailto:z...@ics.forth.gr>> wrote:
> Hello,
> 
> I am running an experiment to test logistic and linear regression on spark 
> using MLlib.
> 
> My dataset is only 128MB and something weird happens. Linear regression takes 
> about 127 seconds either with 1 or 500 iterations. On the other hand, 
> logistic regression most of the times does not manage to finish either with 1 
> iteration. I usually get memory heap error.
> 
> In both cases I use the default cores and memory for driver and I spawn 1 
> executor with 1 core and 2GBs of memory. 
> 
> Except that, I get a warning about NativeBLAS. I searched in the Internet and 
> I found that I have to install libgfortran. Even if I did it the warning 
> remains.
> 
> Any ideas for the above?
> 
> Thank you,
> - Thodoris
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 



ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Thodoris Zois
Hello,

I am running an experiment to test logistic and linear regression on spark 
using MLlib.

My dataset is only 128MB and something weird happens. Linear regression takes 
about 127 seconds either with 1 or 500 iterations. On the other hand, logistic 
regression most of the times does not manage to finish either with 1 iteration. 
I usually get memory heap error.

In both cases I use the default cores and memory for driver and I spawn 1 
executor with 1 core and 2GBs of memory. 

Except that, I get a warning about NativeBLAS. I searched in the Internet and I 
found that I have to install libgfortran. Even if I did it the warning remains.

Any ideas for the above?

Thank you,
- Thodoris

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala program to spark-submit on k8 cluster

2018-04-06 Thread Thodoris Zois
If you are looking for a Spark scheduler that runs on top of Kubernetes then 
this is the way to go: 
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

You can also have a look here: 
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html

- Thodoris

> On 6 Apr 2018, at 15:04, Kittu M  wrote:
> 
> Yes Yinan I’m looking for a Scala program which submits a Spark job to a k8s 
> cluster by running spark-submit programmatically 
> 
>> On Wednesday, April 4, 2018, Yinan Li  wrote:
>> Hi Kittu,
>> 
>> What do you mean by "a Scala program"? Do you mean a program that submits a 
>> Spark job to a k8s cluster by running spark-submit programmatically, or some 
>> example Scala application that is to run on the cluster? 
>> 
>>> On Wed, Apr 4, 2018 at 4:45 AM, Kittu M  wrote:
>>> Hi,
>>> 
>>> I’m looking for a Scala program to spark submit a Scala application (spark 
>>> 2.3 job) on k8 cluster .
>>> 
>>> Any help  would be much appreciated. Thanks 
>>> 
>>> 
>> 


1 Executor per partition

2018-04-04 Thread Thodoris Zois

Hello list!

I am trying to familiarize with Apache Spark. I  would like to ask something 
about partitioning and executors. 

Can I have e.g: 500 partitions but launch only one executor that will run 
operations in only 1 partition of the 500? And then I would like my job to die. 

Is there any easy way? Or i have to modify code to achieve that?

Thank you,
 Thodoris

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org