Using YARN w/o HDFS
Hi, Can we run Spark on YARN with out installing HDFS? If yes, where would HADOOP_CONF_DIR point to? Regards, -- *This message may contain confidential and privileged information. If it has been sent to you in error, please reply to advise the sender of the error and then immediately permanently delete it and all attachments to it from your systems. If you are not the intended recipient, do not read, copy, disclose or otherwise use this message or any attachments to it. The sender disclaims any liability for such unauthorized use. PLEASE NOTE that all incoming e-mails sent to PDF e-mail accounts will be archived and may be scanned by us and/or by external service providers to detect and prevent threats to our systems, investigate illegal or inappropriate behavior, and/or eliminate unsolicited promotional e-mails (“spam”). If you have any concerns about this process, please contact us at * *legal.departm...@pdf.com**.*
RE: PowerIterationClustering Benchmark
Hi All, I have the same issue with one compressed file .tgz around 3 GB. I increase the nodes without any affect to the performance. Best Regards, Mostafa Alaa Mohamed, Technical Expert Big Data, M: +971506450787 Email: mohamedamost...@etisalat.ae<mailto:mohamedamost...@etisalat.ae> From: Lydia Ickler [mailto:ickle...@googlemail.com] Sent: Friday, December 16, 2016 02:04 AM To: user@spark.apache.org Subject: PowerIterationClustering Benchmark Hi all, I have a question regarding the PowerIterationClusteringExample. I have adjusted the code so that it reads a file via „sc.textFile(„path/to/input“)“ which works fine. Now I wanted to benchmark the algorithm using different number of nodes to see how well the implementation scales. As a testbed I have up to 32 nodes available, each with 16 cores and Spark 2.0.2 on Yarn running. For my smallest input data set (16MB) the runtime does not really change if I use 1,2,4,8,16 or 32 nodes. (always ~ 1.5 minute) Same behavior for my largest data set (2.3GB). The runtime stays around 1h if I use 16 or if I use 32 nodes. I was expecting that when I e.g. double the number of nodes the runtime would shrink. As for setting up my cluster environment I tried different suggestions from this paper https://hal.inria.fr/hal-01347638v1/document Has someone experienced the same? Or has someone suggestions what might went wrong? Thanks in advance! Lydia The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies.
Unsubscribe
Unsubscribe Best Regards, Mostafa Alaa Mohamed, Technical Expert Big Data, M: +971506450787 Email: mohamedamost...@etisalat.ae -Original Message- From: balaji9058 [mailto:kssb...@gmail.com] Sent: Wednesday, December 14, 2016 08:32 AM To: user@spark.apache.org Subject: Re: Graphx triplet comparison Hi Thanks for reply. Here is my code: class BusStopNode(val name: String,val mode:String,val maxpasengers :Int) extends Serializable case class busstop(override val name: String,override val mode:String,val shelterId: String, override val maxpasengers :Int) extends BusStopNode(name,mode,maxpasengers) with Serializable case class busNodeDetails(override val name: String,override val mode:String,val srcId: Int,val destId :Int,val arrivalTime :Int,override val maxpasengers :Int) extends BusStopNode(name,mode,maxpasengers) with Serializable case class routeDetails(override val name: String,override val mode:String,val srcId: Int,val destId :Int,override val maxpasengers :Int) extends BusStopNode(name,mode,maxpasengers) with Serializable val busstopRDD: RDD[(VertexId, BusStopNode)] = sc.textFile("\\BusStopNameMini.txt").filter(!_.startsWith("#")). map { line => val row = line split "," (row(0).toInt, new busstop(row(0),row(3),row(1)+row(0),row(2).toInt)) } busstopRDD.foreach(println) val busNodeDetailsRdd: RDD[(VertexId, BusStopNode)] = sc.textFile("\\RouteDetails.txt").filter(!_.startsWith("#")). map { line => val row = line split "," (row(0).toInt, new busNodeDetails(row(0),row(4),row(1).toInt,row(2).toInt,row(3).toInt,0)) } busNodeDetailsRdd.foreach(println) val detailedStats: RDD[Edge[BusStopNode]] = sc.textFile("\\routesEdgeNew.txt"). filter(! _.startsWith("#")). map {line => val row = line split ',' Edge(row(0).toInt, row(1).toInt,new BusStopNode(row(2), row(3),1) )} val busGraph = busstopRDD ++ busNodeDetailsRdd busGraph.foreach(println) val mainGraph = Graph(busGraph, detailedStats) mainGraph.triplets.foreach(println) val subGraph = mainGraph subgraph (epred = _.srcAttr.name == "101") //Working Fine for (subTriplet <- subGraph.triplets) { println(subTriplet.dstAttr.name) } //Working fine for (mainTriplet <- mainGraph.triplets) { println(subTriplet.dstAttr.name) } //causing error while iterating both at same time for (subTriplet <- subGraph.triplets) { for (mainTriplet <- mainGraph.triplets) { //Nullpointer exception is causing here if (subTriplet.dstAttr.name.toString.equals(mainTriplet.dstAttr.name)) { println("hello")//success case on both destination names of of subgraph and maingraph } } } } BusStopNameMini.txt 101,bs,10,B 102,bs,10,B 103,bs,20,B 104,bs,14,B 105,bs,8,B RouteDetails.txt #101,102,104 4 5 6 #102,103 3 4 #103,105,104 2 3 4 #104,102,101 4 5 6 #104,1015 #105,104,102 5 6 2 1,101,104,5,R 2,102,103,5,R 3,103,104,5,R 4,102,103,5,R 5,104,101,5,R 6,105,102,5,R routesEdgeNew.txt it contains two types of edges are bus to bus with edge value is distance and bus to route with edge value as time #101,102,104 4 5 6 #102,103 3 4 #103,105,104 2 3 4 #104,102,101 4 5 6 #104,1015 #105,104,102 5 6 2 101,102,4,BS 102,104,5,BS 102,103,3,BS 103,105,4,BS 105,104,3,BS 104,102,4,BS 102,101,5,BS 104,101,5,BS 105,104,5,BS 104,102,6,BS 101,1,4,R,102 101,1,4,R,103 102,2,5,R 103,3,6,R 103,3,5,R 104,4,7,R 105,5,4,Z 101,2,9,R 105,5,4,R 105,2,5,R 104,2,5,R 103,1,4,R 101,103,4,BS 101,104,4,BS 101,105,4,BS 101,103,5,BS 101,104,5,BS 101,105,5,BS 1,101,4,R -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-triplet-comparison-tp28198p28205.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark Hive Rejection
Dears, I want to ask * What will happened if there are rejections rows when inserting dataframe into hive? o Rejection will be for example table required integer into column and dataframe include string. o Duplication rejection restriction from the table itself? * How can we specify the rejection directory? If not avaiable do you recommend to open Jira issue? Best Regards, Mostafa Alaa Mohamed, Technical Expert Big Data, M: +971506450787 Email: mohamedamost...@etisalat.ae<mailto:mohamedamost...@etisalat.ae> The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies.
DataFrame Rejection Directory
Hi All, I have dataframe contains some data and I need to insert it into hive table. My questions 1- Where will spark save the rejected rows from the insertion statements? 2- Can spark failed if some rows rejected? 3- How can I specify the rejection directory? Regards, The content of this email together with any attachments, statements and opinions expressed herein contains information that is private and confidential are intended for the named addressee(s) only. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. If you have received this message in error please notify postmas...@etisalat.ae by email immediately and delete the message without making any copies. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark partitions from CassandraRDD
Hi, I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark connector 1.4, running in standalone mode. I am getting 4000 rows from Cassandra (4mb row), where the row keys are random. .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache I am expecting that it will generate few partitions. However, I can ONLY see 1 partition. I cached the CassandraRDD and in the UI storage tab it shows ONLY 1 partition. Any idea, why I am getting 1 partition? Thanks, Alaa -- *This message may contain confidential and privileged information. If it has been sent to you in error, please reply to advise the sender of the error and then immediately permanently delete it and all attachments to it from your systems. If you are not the intended recipient, do not read, copy, disclose or otherwise use this message or any attachments to it. The sender disclaims any liability for such unauthorized use. PLEASE NOTE that all incoming e-mails sent to PDF e-mail accounts will be archived and may be scanned by us and/or by external service providers to detect and prevent threats to our systems, investigate illegal or inappropriate behavior, and/or eliminate unsolicited promotional e-mails (“spam”). If you have any concerns about this process, please contact us at * *legal.departm...@pdf.com* <legal.departm...@pdf.com>*.*
Re: Spark partitions from CassandraRDD
Thanks Ankur, But I grabbed some keys from the Spark results and ran "nodetool -h getendpoints " and it showed the data is coming from at least 2 nodes? Regards, Alaa On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi Alaa, > > Partition when using CassandraRDD depends on your partition key in > Cassandra table. > > If you see only 1 partition in the RDD it means all the rows you have > selected have same partition_key in C* > > Thanks > Ankur > > > On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) <alaa.zuba...@pdf.com> > wrote: > >> Hi, >> >> I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark >> connector 1.4, running in standalone mode. >> >> I am getting 4000 rows from Cassandra (4mb row), where the row keys are >> random. >> .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache >> >> I am expecting that it will generate few partitions. >> However, I can ONLY see 1 partition. >> I cached the CassandraRDD and in the UI storage tab it shows ONLY 1 >> partition. >> >> Any idea, why I am getting 1 partition? >> >> Thanks, >> Alaa >> >> >> >> *This message may contain confidential and privileged information. If it >> has been sent to you in error, please reply to advise the sender of the >> error and then immediately permanently delete it and all attachments to it >> from your systems. If you are not the intended recipient, do not read, >> copy, disclose or otherwise use this message or any attachments to it. The >> sender disclaims any liability for such unauthorized use. PLEASE NOTE that >> all incoming e-mails sent to PDF e-mail accounts will be archived and may >> be scanned by us and/or by external service providers to detect and prevent >> threats to our systems, investigate illegal or inappropriate behavior, >> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any >> concerns about this process, please contact us at * >> *legal.departm...@pdf.com* <legal.departm...@pdf.com>*.* > > > -- Alaa Zubaidi PDF Solutions, Inc. 333 West San Carlos Street, Suite 1000 San Jose, CA 95110 USA Tel: 408-283-5639 fax: 408-938-6479 email: alaa.zuba...@pdf.com -- *This message may contain confidential and privileged information. If it has been sent to you in error, please reply to advise the sender of the error and then immediately permanently delete it and all attachments to it from your systems. If you are not the intended recipient, do not read, copy, disclose or otherwise use this message or any attachments to it. The sender disclaims any liability for such unauthorized use. PLEASE NOTE that all incoming e-mails sent to PDF e-mail accounts will be archived and may be scanned by us and/or by external service providers to detect and prevent threats to our systems, investigate illegal or inappropriate behavior, and/or eliminate unsolicited promotional e-mails (“spam”). If you have any concerns about this process, please contact us at * *legal.departm...@pdf.com* <legal.departm...@pdf.com>*.*
Creating a front-end for output from Spark/PySpark
Hello. Okay, so I'm working on a project to run analytic processing using Spark or PySpark. Right now, I connect to the shell and execute my commands. The very first part of my commands is: create an SQL JDBC connection and cursor to pull from Apache Phoenix, do some processing on the returned data, and spit out some output. I want to create a web gui tool kind of a thing where I play around with what SQL query is executed for my analysis. I know that I can write my whole Spark program and use spark-submit and have it accept and argument to be the SQL query I want to execute, but this means that every time I submit: an SQL connection will be created, query ran, processing done, output printed, program closes and SQL connection closes, and then the whole thing repeats if I want to do another query right away. That will probably cause it to be very slow. Is there a way where I can somehow have the SQL connection working in the backend for example, and then all I have to do is supply a query from my GUI tool where it then takes it, runs it, displays the output? I just want to know the big picture and a broad overview of how would I go about doing this and what additional technology to use and I'll dig up the rest. Regards, Alaa Ali
Re: Spark SQL with Apache Phoenix lower and upper Bound
Thanks Alex! I'm actually working with views from HBase because I will never edit the HBase table from Phoenix and I'd hate to accidentally drop it. I'll have to work out how to create the view with the additional ID column. Regards, Alaa Ali On Fri, Nov 21, 2014 at 5:26 PM, Alex Kamil alex.ka...@gmail.com wrote: Ali, just create a BIGINT column with numeric values in phoenix and use sequences http://phoenix.apache.org/sequences.html to populate it automatically I included the setup below in case someone starts from scratch Prerequisites: - export JAVA_HOME, SCALA_HOME and install sbt - install hbase in standalone mode http://hbase.apache.org/book/quickstart.html - add phoenix jar http://phoenix.apache.org/download.html to hbase lib directory - start hbase and create a table in phoenix http://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html to verify everything is working - install spark in standalone mode, and verify that it works using spark shell http://spark.apache.org/docs/latest/quick-start.html 1. create a sequence http://phoenix.apache.org/sequences.html in phoenix: $PHOENIX_HOME/hadoop1/bin/sqlline.py localhost CREATE SEQUENCE IF NOT EXISTS my_schema.my_sequence; 2.add a BIGINT column called e.g. id to your table in phoenix CREATE TABLE test.orders ( id BIGINT not null primary key, name VARCHAR); 3. add some values UPSERT INTO test.orders (id, name) VALUES( NEXT VALUE FOR my_schema.my_sequence, 'foo'); UPSERT INTO test.orders (id, name) VALUES( NEXT VALUE FOR my_schema.my_sequence, 'bar'); 4. create jdbc adapter (following SimpleApp setup in Spark-GettingStarted-StandAlone applications https://spark.apache.org/docs/latest/quick-start.html#Standalone_Applications ): //SparkToJDBC.scala import java.sql.DriverManager import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; import java.util.Date; import org.apache.spark.SparkContext import org.apache.spark.rdd.JdbcRDD object SparkToJDBC { def main(args: Array[String]) { val sc = new SparkContext(local, phoenix) try{ val rdd = new JdbcRDD(sc,() = { Class.forName(org.apache.phoenix.jdbc.PhoenixDriver).newInstance() DriverManager.getConnection(jdbc:phoenix:localhost, , ) }, SELECT id, name FROM test.orders WHERE id = ? AND id = ?, 1, 100, 3, (r:ResultSet) = { processResultSet(r) } ).cache() println(#); println(rdd.count()); println(#); } catch { case _: Throwable = println(Could not connect to database) } sc.stop() } def processResultSet(rs: ResultSet){ val rsmd = rs.getMetaData() val numberOfColumns = rsmd.getColumnCount() var i = 1 while (i = numberOfColumns) { val colName = rsmd.getColumnName(i) val tableName = rsmd.getTableName(i) val name = rsmd.getColumnTypeName(i) val caseSen = rsmd.isCaseSensitive(i) val writable = rsmd.isWritable(i) println(Information for column + colName) println(Column is in table + tableName) println(column type is + name) println() i += 1 } while (rs.next()) { var i = 1 while (i = numberOfColumns) { val s = rs.getString(i) System.out.print(s + ) i += 1 } println() } } } 5. build SparkToJDBC.scala sbt package 6. execute spark job: note: don't forget to add phoenix jar using --jars option like this: ../spark-1.1.0/bin/spark-submit *--jars ../phoenix-3.1.0-bin/hadoop2/* *phoenix-3.1.0-client-hadoop2.**jar *--class SparkToJDBC --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar regards Alex On Fri, Nov 21, 2014 at 4:34 PM, Josh Mahonin jmaho...@interset.com wrote: Hi Alaa Ali, In order for Spark to split the JDBC query in parallel, it expects an upper and lower bound for your input data, as well as a number of partitions so that it can split the query across multiple tasks. For example, depending on your data distribution, you could set an upper and lower bound on your timestamp range, and spark should be able to create new sub-queries to split up the data. Another option is to load up the whole table using the PhoenixInputFormat as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate functions, but it does let you load up whole tables as RDDs. I've previously posted example code here: http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox
Spark SQL with Apache Phoenix lower and upper Bound
I want to run queries on Apache Phoenix which has a JDBC driver. The query that I want to run is: select ts,ename from random_data_date limit 10 But I'm having issues with the JdbcRDD upper and lowerBound parameters (that I don't actually understand). Here's what I have so far: import org.apache.spark.rdd.JdbcRDD import java.sql.{Connection, DriverManager, ResultSet} val url=jdbc:phoenix:zookeeper val sql = select ts,ename from random_data_date limit ? val myRDD = new JdbcRDD(sc, () = DriverManager.getConnection(url), sql, 5, 10, 2, r = r.getString(ts) + , + r.getString(ename)) But this doesn't work because the sql expression that the JdbcRDD expects has to have two ?s to represent the lower and upper bound. How can I run my query through the JdbcRDD? Regards, Alaa Ali
Re: Spark SQL with Apache Phoenix lower and upper Bound
Awesome, thanks Josh, I missed that previous post of yours! But your code snippet shows a select statement, so what I can do is just run a simple select with a where clause if I want to, and then run my data processing on the RDD to mimic the aggregation I want to do with SQL, right? Also, another question, I still haven't tried this out, but I'll actually be using this with PySpark, so I'm guessing the PhoenixPigConfiguration and newHadoopRDD can be defined in PySpark as well? Regards, Alaa Ali On Fri, Nov 21, 2014 at 4:34 PM, Josh Mahonin jmaho...@interset.com wrote: Hi Alaa Ali, In order for Spark to split the JDBC query in parallel, it expects an upper and lower bound for your input data, as well as a number of partitions so that it can split the query across multiple tasks. For example, depending on your data distribution, you could set an upper and lower bound on your timestamp range, and spark should be able to create new sub-queries to split up the data. Another option is to load up the whole table using the PhoenixInputFormat as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate functions, but it does let you load up whole tables as RDDs. I've previously posted example code here: http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAJ6CGtA1DoTdadRtT5M0+75rXTyQgu5gexT+uLccw_8Ppzyt=q...@mail.gmail.com%3E There's also an example library implementation here, although I haven't had a chance to test it yet: https://github.com/simplymeasured/phoenix-spark Josh On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali contact.a...@gmail.com wrote: I want to run queries on Apache Phoenix which has a JDBC driver. The query that I want to run is: select ts,ename from random_data_date limit 10 But I'm having issues with the JdbcRDD upper and lowerBound parameters (that I don't actually understand). Here's what I have so far: import org.apache.spark.rdd.JdbcRDD import java.sql.{Connection, DriverManager, ResultSet} val url=jdbc:phoenix:zookeeper val sql = select ts,ename from random_data_date limit ? val myRDD = new JdbcRDD(sc, () = DriverManager.getConnection(url), sql, 5, 10, 2, r = r.getString(ts) + , + r.getString(ename)) But this doesn't work because the sql expression that the JdbcRDD expects has to have two ?s to represent the lower and upper bound. How can I run my query through the JdbcRDD? Regards, Alaa Ali
Re: pyspark get column family and qualifier names from hbase table
Hey freedafeng, I'm exactly where you are. I want the output to show the rowkey and all column qualifiers that correspond to it. How did you write HBaseResultToStringConverter to do what you wanted it to do? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org