Using YARN w/o HDFS

2017-06-21 Thread Alaa Zubaidi (PDF)
Hi,

Can we run Spark on YARN with out installing HDFS?
If yes, where would HADOOP_CONF_DIR point to?

Regards,

-- 
*This message may contain confidential and privileged information. If it 
has been sent to you in error, please reply to advise the sender of the 
error and then immediately permanently delete it and all attachments to it 
from your systems. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message or any attachments to it. The 
sender disclaims any liability for such unauthorized use. PLEASE NOTE that 
all incoming e-mails sent to PDF e-mail accounts will be archived and may 
be scanned by us and/or by external service providers to detect and prevent 
threats to our systems, investigate illegal or inappropriate behavior, 
and/or eliminate unsolicited promotional e-mails (“spam”). If you have any 
concerns about this process, please contact us at *
*legal.departm...@pdf.com* *.*


RE: PowerIterationClustering Benchmark

2016-12-18 Thread Mostafa Alaa Mohamed
Hi All,

I have the same issue with one compressed file .tgz around 3 GB. I increase the 
nodes without any affect to the performance.


Best Regards,
Mostafa Alaa Mohamed,
Technical Expert Big Data,
M: +971506450787
Email: mohamedamost...@etisalat.ae<mailto:mohamedamost...@etisalat.ae>

From: Lydia Ickler [mailto:ickle...@googlemail.com]
Sent: Friday, December 16, 2016 02:04 AM
To: user@spark.apache.org
Subject: PowerIterationClustering Benchmark

Hi all,

I have a question regarding the PowerIterationClusteringExample.
I have adjusted the code so that it reads a file via 
„sc.textFile(„path/to/input“)“ which works fine.

Now I wanted to benchmark the algorithm using different number of nodes to see 
how well the implementation scales. As a testbed I have up to 32 nodes 
available, each with 16 cores and Spark 2.0.2 on Yarn running.
For my smallest input data set (16MB) the runtime does not really change if I 
use 1,2,4,8,16 or 32 nodes. (always ~ 1.5 minute)
Same behavior for my largest data set (2.3GB). The runtime stays around 1h if I 
use 16 or if I use 32 nodes.

I was expecting that when I e.g. double the number of nodes the runtime would 
shrink.
As for setting up my cluster environment I tried different suggestions from 
this paper https://hal.inria.fr/hal-01347638v1/document

Has someone experienced the same? Or has someone suggestions what might went 
wrong?

Thanks in advance!
Lydia



The content of this email together with any attachments, statements and 
opinions expressed herein contains information that is private and confidential 
are intended for the named addressee(s) only. If you are not the addressee of 
this email you may not copy, forward, disclose or otherwise use it or any part 
of it in any form whatsoever. If you have received this message in error please 
notify postmas...@etisalat.ae by email immediately and delete the message 
without making any copies.


Unsubscribe

2016-12-14 Thread Mostafa Alaa Mohamed
Unsubscribe


Best Regards,
Mostafa Alaa Mohamed,
Technical Expert Big Data,
M: +971506450787
Email: mohamedamost...@etisalat.ae

-Original Message-
From: balaji9058 [mailto:kssb...@gmail.com]
Sent: Wednesday, December 14, 2016 08:32 AM
To: user@spark.apache.org
Subject: Re: Graphx triplet comparison

Hi Thanks for reply.

Here is my code:
class BusStopNode(val name: String,val mode:String,val maxpasengers :Int) 
extends Serializable case class busstop(override val name: String,override val 
mode:String,val
shelterId: String, override val maxpasengers :Int) extends
BusStopNode(name,mode,maxpasengers) with Serializable
case class busNodeDetails(override val name: String,override val 
mode:String,val srcId: Int,val destId :Int,val arrivalTime :Int,override val 
maxpasengers :Int) extends BusStopNode(name,mode,maxpasengers) with Serializable
case class routeDetails(override val name: String,override val 
mode:String,val srcId: Int,val destId :Int,override val maxpasengers :Int) 
extends BusStopNode(name,mode,maxpasengers) with Serializable

val busstopRDD: RDD[(VertexId, BusStopNode)] =
  sc.textFile("\\BusStopNameMini.txt").filter(!_.startsWith("#")).
map { line =>
  val row = line split ","
  (row(0).toInt, new
busstop(row(0),row(3),row(1)+row(0),row(2).toInt))
}

busstopRDD.foreach(println)

val busNodeDetailsRdd: RDD[(VertexId, BusStopNode)] =
  sc.textFile("\\RouteDetails.txt").filter(!_.startsWith("#")).
map { line =>
  val row = line split ","
  (row(0).toInt, new
busNodeDetails(row(0),row(4),row(1).toInt,row(2).toInt,row(3).toInt,0))
}
busNodeDetailsRdd.foreach(println)

 val detailedStats: RDD[Edge[BusStopNode]] =
sc.textFile("\\routesEdgeNew.txt").
filter(! _.startsWith("#")).
map {line =>
val row = line split ','
Edge(row(0).toInt, row(1).toInt,new BusStopNode(row(2),
row(3),1)
   )}

val busGraph = busstopRDD ++ busNodeDetailsRdd
busGraph.foreach(println)
val mainGraph = Graph(busGraph, detailedStats)
mainGraph.triplets.foreach(println)
 val subGraph = mainGraph subgraph (epred = _.srcAttr.name == "101")
 //Working Fine
 for (subTriplet <- subGraph.triplets) {
 println(subTriplet.dstAttr.name)
 }

 //Working fine
  for (mainTriplet <- mainGraph.triplets) {
 println(subTriplet.dstAttr.name)
 }

 //causing error while iterating both at same time
 for (subTriplet <- subGraph.triplets) {
for (mainTriplet <- mainGraph.triplets) {   //Nullpointer exception
is causing here
   if
(subTriplet.dstAttr.name.toString.equals(mainTriplet.dstAttr.name)) {

  println("hello")//success case on both destination names of of 
subgraph and maingraph
}
  }
}
}

BusStopNameMini.txt
101,bs,10,B
102,bs,10,B
103,bs,20,B
104,bs,14,B
105,bs,8,B


RouteDetails.txt

#101,102,104  4 5 6
#102,103 3 4
#103,105,104 2 3 4
#104,102,101  4 5 6
#104,1015
#105,104,102 5 6 2
1,101,104,5,R
2,102,103,5,R
3,103,104,5,R
4,102,103,5,R
5,104,101,5,R
6,105,102,5,R

routesEdgeNew.txt it contains two types of edges are bus to bus with edge value 
is distance and bus to route with edge value as time
#101,102,104  4 5 6
#102,103 3 4
#103,105,104 2 3 4
#104,102,101  4 5 6
#104,1015
#105,104,102 5 6 2
101,102,4,BS
102,104,5,BS
102,103,3,BS
103,105,4,BS
105,104,3,BS
104,102,4,BS
102,101,5,BS
104,101,5,BS
105,104,5,BS
104,102,6,BS
101,1,4,R,102
101,1,4,R,103
102,2,5,R
103,3,6,R
103,3,5,R
104,4,7,R
105,5,4,Z
101,2,9,R
105,5,4,R
105,2,5,R
104,2,5,R
103,1,4,R
101,103,4,BS
101,104,4,BS
101,105,4,BS
101,103,5,BS
101,104,5,BS
101,105,5,BS
1,101,4,R







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198p28205.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



The content of this email together with any attachments, statements and 
opinions expressed herein contains information that is private and confidential 
are intended for the named addressee(s) only. If you are not the addressee of 
this email you may not copy, forward, disclose or otherwise use it or any part 
of it in any form whatsoever. If you have received this message in error please 
notify postmas...@etisalat.ae by email immediately and delete the message 
without making any copies.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Hive Rejection

2016-09-29 Thread Mostafa Alaa Mohamed
Dears,
I want to ask

* What will happened if there are rejections rows when inserting 
dataframe into hive?

o   Rejection will be for example table required integer into column and 
dataframe include string.

o   Duplication rejection restriction from the table itself?

* How can we specify the rejection directory?
If not avaiable do you recommend to open Jira issue?

Best Regards,
Mostafa Alaa Mohamed,
Technical Expert Big Data,
M: +971506450787
Email: mohamedamost...@etisalat.ae<mailto:mohamedamost...@etisalat.ae>


The content of this email together with any attachments, statements and 
opinions expressed herein contains information that is private and confidential 
are intended for the named addressee(s) only. If you are not the addressee of 
this email you may not copy, forward, disclose or otherwise use it or any part 
of it in any form whatsoever. If you have received this message in error please 
notify postmas...@etisalat.ae by email immediately and delete the message 
without making any copies.


DataFrame Rejection Directory

2016-09-27 Thread Mostafa Alaa Mohamed
Hi All,
I have dataframe contains some data and I need to insert it into hive table. My 
questions

1- Where will spark save the rejected rows from the insertion statements?
2- Can spark failed if some rows rejected?
3- How can I specify the rejection directory?

Regards,


The content of this email together with any attachments, statements and 
opinions expressed herein contains information that is private and confidential 
are intended for the named addressee(s) only. If you are not the addressee of 
this email you may not copy, forward, disclose or otherwise use it or any part 
of it in any form whatsoever. If you have received this message in error please 
notify postmas...@etisalat.ae by email immediately and delete the message 
without making any copies.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)
Hi,

I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark
connector 1.4, running in standalone mode.

I am getting 4000 rows from Cassandra (4mb row), where the row keys are
random.
.. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache

I am expecting that it will generate few partitions.
However, I can ONLY see 1 partition.
I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
partition.

Any idea, why I am getting 1 partition?

Thanks,
Alaa

-- 
*This message may contain confidential and privileged information. If it 
has been sent to you in error, please reply to advise the sender of the 
error and then immediately permanently delete it and all attachments to it 
from your systems. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message or any attachments to it. The 
sender disclaims any liability for such unauthorized use. PLEASE NOTE that 
all incoming e-mails sent to PDF e-mail accounts will be archived and may 
be scanned by us and/or by external service providers to detect and prevent 
threats to our systems, investigate illegal or inappropriate behavior, 
and/or eliminate unsolicited promotional e-mails (“spam”). If you have any 
concerns about this process, please contact us at *
*legal.departm...@pdf.com* <legal.departm...@pdf.com>*.*


Re: Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)
Thanks Ankur,

But I grabbed some keys from the Spark results and ran "nodetool -h
getendpoints " and it showed the data is coming from at least 2 nodes?
Regards,
Alaa

On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi Alaa,
>
> Partition when using CassandraRDD depends on your partition key in
> Cassandra table.
>
> If you see only 1 partition in the RDD it means all the rows you have
> selected have same partition_key in C*
>
> Thanks
> Ankur
>
>
> On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) <alaa.zuba...@pdf.com>
> wrote:
>
>> Hi,
>>
>> I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark
>> connector 1.4, running in standalone mode.
>>
>> I am getting 4000 rows from Cassandra (4mb row), where the row keys are
>> random.
>> .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache
>>
>> I am expecting that it will generate few partitions.
>> However, I can ONLY see 1 partition.
>> I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
>> partition.
>>
>> Any idea, why I am getting 1 partition?
>>
>> Thanks,
>> Alaa
>>
>>
>>
>> *This message may contain confidential and privileged information. If it
>> has been sent to you in error, please reply to advise the sender of the
>> error and then immediately permanently delete it and all attachments to it
>> from your systems. If you are not the intended recipient, do not read,
>> copy, disclose or otherwise use this message or any attachments to it. The
>> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
>> all incoming e-mails sent to PDF e-mail accounts will be archived and may
>> be scanned by us and/or by external service providers to detect and prevent
>> threats to our systems, investigate illegal or inappropriate behavior,
>> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
>> concerns about this process, please contact us at *
>> *legal.departm...@pdf.com* <legal.departm...@pdf.com>*.*
>
>
>


-- 

Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 1000
San Jose, CA 95110  USA
Tel: 408-283-5639
fax: 408-938-6479
email: alaa.zuba...@pdf.com

-- 
*This message may contain confidential and privileged information. If it 
has been sent to you in error, please reply to advise the sender of the 
error and then immediately permanently delete it and all attachments to it 
from your systems. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message or any attachments to it. The 
sender disclaims any liability for such unauthorized use. PLEASE NOTE that 
all incoming e-mails sent to PDF e-mail accounts will be archived and may 
be scanned by us and/or by external service providers to detect and prevent 
threats to our systems, investigate illegal or inappropriate behavior, 
and/or eliminate unsolicited promotional e-mails (“spam”). If you have any 
concerns about this process, please contact us at *
*legal.departm...@pdf.com* <legal.departm...@pdf.com>*.*


Creating a front-end for output from Spark/PySpark

2014-11-23 Thread Alaa Ali
Hello. Okay, so I'm working on a project to run analytic processing using
Spark or PySpark. Right now, I connect to the shell and execute my
commands. The very first part of my commands is: create an SQL JDBC
connection and cursor to pull from Apache Phoenix, do some processing on
the returned data, and spit out some output. I want to create a web gui
tool kind of a thing where I play around with what SQL query is executed
for my analysis.

I know that I can write my whole Spark program and use spark-submit and
have it accept and argument to be the SQL query I want to execute, but this
means that every time I submit: an SQL connection will be created, query
ran, processing done, output printed, program closes and SQL connection
closes, and then the whole thing repeats if I want to do another query
right away. That will probably cause it to be very slow. Is there a way
where I can somehow have the SQL connection working in the backend for
example, and then all I have to do is supply a query from my GUI tool where
it then takes it, runs it, displays the output? I just want to know the big
picture and a broad overview of how would I go about doing this and what
additional technology to use and I'll dig up the rest.

Regards,
Alaa Ali


Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-22 Thread Alaa Ali
Thanks Alex! I'm actually working with views from HBase because I will
never edit the HBase table from Phoenix and I'd hate to accidentally drop
it. I'll have to work out how to create the view with the additional ID
column.

Regards,
Alaa Ali

On Fri, Nov 21, 2014 at 5:26 PM, Alex Kamil alex.ka...@gmail.com wrote:

 Ali,

 just create a BIGINT column with numeric values in phoenix and use
 sequences http://phoenix.apache.org/sequences.html to populate it
 automatically

 I included the setup below in case someone starts from scratch

 Prerequisites:
 - export JAVA_HOME, SCALA_HOME and install sbt
 - install hbase in standalone mode
 http://hbase.apache.org/book/quickstart.html
 - add phoenix jar http://phoenix.apache.org/download.html to hbase lib
 directory
 - start hbase and create a table in phoenix
 http://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html to verify
 everything is working
 - install spark in standalone mode, and verify that it works using spark
 shell http://spark.apache.org/docs/latest/quick-start.html

 1. create a sequence http://phoenix.apache.org/sequences.html in
 phoenix:
 $PHOENIX_HOME/hadoop1/bin/sqlline.py localhost

   CREATE SEQUENCE IF NOT EXISTS my_schema.my_sequence;

 2.add a BIGINT column called e.g. id to your table in phoenix

  CREATE TABLE test.orders ( id BIGINT not null primary key, name VARCHAR);

 3. add some values
 UPSERT INTO test.orders (id, name) VALUES( NEXT VALUE FOR
 my_schema.my_sequence, 'foo');
 UPSERT INTO test.orders (id, name) VALUES( NEXT VALUE FOR
 my_schema.my_sequence, 'bar');

 4. create jdbc adapter (following SimpleApp setup in 
 Spark-GettingStarted-StandAlone
 applications
 https://spark.apache.org/docs/latest/quick-start.html#Standalone_Applications
 ):

 //SparkToJDBC.scala

 import java.sql.DriverManager
 import java.sql.Connection;
 import java.sql.DriverManager;
 import java.sql.PreparedStatement;
 import java.sql.ResultSet;
 import java.sql.SQLException;
 import java.sql.Statement;

 import java.util.Date;

 import org.apache.spark.SparkContext
 import org.apache.spark.rdd.JdbcRDD

 object SparkToJDBC {

   def main(args: Array[String]) {
 val sc = new SparkContext(local, phoenix)
 try{
 val rdd = new JdbcRDD(sc,() = {

 Class.forName(org.apache.phoenix.jdbc.PhoenixDriver).newInstance()

 DriverManager.getConnection(jdbc:phoenix:localhost, , )
},
SELECT id, name  FROM test.orders WHERE id = ? AND id
 = ?,
 1, 100, 3,
 (r:ResultSet) = {
 processResultSet(r)
 }
 ).cache()

 println(#);
 println(rdd.count());
 println(#);
  } catch {
   case _: Throwable = println(Could not connect to database)
  }
  sc.stop()
   }

 def processResultSet(rs: ResultSet){

   val rsmd = rs.getMetaData()
   val numberOfColumns = rsmd.getColumnCount()

   var i = 1
   while (i = numberOfColumns) {
 val colName = rsmd.getColumnName(i)
 val tableName = rsmd.getTableName(i)
 val name = rsmd.getColumnTypeName(i)
 val caseSen = rsmd.isCaseSensitive(i)
 val writable = rsmd.isWritable(i)
 println(Information for column  + colName)
 println(Column is in table  + tableName)
 println(column type is  + name)
 println()
 i += 1
   }

   while (rs.next()) {
 var i = 1
 while (i = numberOfColumns) {
   val s = rs.getString(i)
   System.out.print(s +   )
   i += 1
 }
 println()
   }
}

 }

 5. build SparkToJDBC.scala
 sbt package

 6. execute spark job:
 note: don't forget to add phoenix jar using --jars option like this:

 ../spark-1.1.0/bin/spark-submit *--jars ../phoenix-3.1.0-bin/hadoop2/*
 *phoenix-3.1.0-client-hadoop2.**jar *--class SparkToJDBC --master
 local[4] target/scala-2.10/simple-project_2.10-1.0.jar

 regards
 Alex


 On Fri, Nov 21, 2014 at 4:34 PM, Josh Mahonin jmaho...@interset.com
 wrote:

 Hi Alaa Ali,

 In order for Spark to split the JDBC query in parallel, it expects an
 upper and lower bound for your input data, as well as a number of
 partitions so that it can split the query across multiple tasks.

 For example, depending on your data distribution, you could set an upper
 and lower bound on your timestamp range, and spark should be able to create
 new sub-queries to split up the data.

 Another option is to load up the whole table using the PhoenixInputFormat
 as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate
 functions, but it does let you load up whole tables as RDDs.

 I've previously posted example code here:

 http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox

Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Alaa Ali
I want to run queries on Apache Phoenix which has a JDBC driver. The query
that I want to run is:

select ts,ename from random_data_date limit 10

But I'm having issues with the JdbcRDD upper and lowerBound parameters
(that I don't actually understand).

Here's what I have so far:

import org.apache.spark.rdd.JdbcRDD
import java.sql.{Connection, DriverManager, ResultSet}

val url=jdbc:phoenix:zookeeper
val sql = select ts,ename from random_data_date limit ?
val myRDD = new JdbcRDD(sc, () = DriverManager.getConnection(url), sql, 5,
10, 2, r = r.getString(ts) + ,  + r.getString(ename))

But this doesn't work because the sql expression that the JdbcRDD expects
has to have two ?s to represent the lower and upper bound.

How can I run my query through the JdbcRDD?

Regards,
Alaa Ali


Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Alaa Ali
Awesome, thanks Josh, I missed that previous post of yours! But your code
snippet shows a select statement, so what I can do is just run a simple
select with a where clause if I want to, and then run my data processing on
the RDD to mimic the aggregation I want to do with SQL, right? Also,
another question, I still haven't tried this out, but I'll actually be
using this with PySpark, so I'm guessing the PhoenixPigConfiguration and
newHadoopRDD can be defined in PySpark as well?

Regards,
Alaa Ali

On Fri, Nov 21, 2014 at 4:34 PM, Josh Mahonin jmaho...@interset.com wrote:

 Hi Alaa Ali,

 In order for Spark to split the JDBC query in parallel, it expects an
 upper and lower bound for your input data, as well as a number of
 partitions so that it can split the query across multiple tasks.

 For example, depending on your data distribution, you could set an upper
 and lower bound on your timestamp range, and spark should be able to create
 new sub-queries to split up the data.

 Another option is to load up the whole table using the PhoenixInputFormat
 as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate
 functions, but it does let you load up whole tables as RDDs.

 I've previously posted example code here:

 http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAJ6CGtA1DoTdadRtT5M0+75rXTyQgu5gexT+uLccw_8Ppzyt=q...@mail.gmail.com%3E

 There's also an example library implementation here, although I haven't
 had a chance to test it yet:
 https://github.com/simplymeasured/phoenix-spark

 Josh

 On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali contact.a...@gmail.com wrote:

 I want to run queries on Apache Phoenix which has a JDBC driver. The
 query that I want to run is:

 select ts,ename from random_data_date limit 10

 But I'm having issues with the JdbcRDD upper and lowerBound parameters
 (that I don't actually understand).

 Here's what I have so far:

 import org.apache.spark.rdd.JdbcRDD
 import java.sql.{Connection, DriverManager, ResultSet}

 val url=jdbc:phoenix:zookeeper
 val sql = select ts,ename from random_data_date limit ?
 val myRDD = new JdbcRDD(sc, () = DriverManager.getConnection(url), sql,
 5, 10, 2, r = r.getString(ts) + ,  + r.getString(ename))

 But this doesn't work because the sql expression that the JdbcRDD expects
 has to have two ?s to represent the lower and upper bound.

 How can I run my query through the JdbcRDD?

 Regards,
 Alaa Ali





Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread alaa
Hey freedafeng, I'm exactly where you are. I want the output to show the
rowkey and all column qualifiers that correspond to it. How did you write
HBaseResultToStringConverter to do what you wanted it to do?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org