The parameter spark.yarn.executor.memoryOverhead
Hi Gurus, The parameter spark.yarn.executor.memoryOverhead is explained as below: spark.yarn.executor.memoryOverhead executorMemory * 0.10, with minimum of 384 The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). So does that mean that for executor of 10GB this should be ideally set to ~ 10% = 1GB? What would happen if we set it higher to say 30% ~ 3GB. What is this memory is exactly used for (as opposed to memory allocated to the executor)? Thanking you
how does spark handle compressed files
Hi, How does spark handle compressed files? Are they optimizable in terms of using multiple RDDs against the file pr one needs to uncompress them beforehand say bz type files. thanks
RDD and DataFrame persistent memory usage
Gurus, I understand when we create RDD in Spark it is immutable. So I have few points please: - When RDD is created that is just a pointer. Not most Spark operations it is lazy not consumed until a collection operation done that affects RDD? - When a DF is created from RDD does that result in additional memory to DF. Again with collection operation that affects both RDD and DF built from that RDD? - There is some references that as you build operations and creating new DFs, one is consuming more and more memory without releasing it back? - What will happen if I do df.unpersist. I know that it shifts DF from memory (cache) to disk. Will that reduce memory overhead? - Is it a good idea to unpersist to reduce memory overhead? Thanking you
Re: how many topics spark streaming can handle
thank you in the following example val topics = "test1,test2,test3" val brokers = "localhost:9092" val topicsSet = topics.split(",").toSet val sparkConf = new SparkConf().setAppName("KafkaDroneCalc").setMaster("local") //spark://localhost:7077 val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(30)) val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder] (ssc, kafkaParams, topicsSet) it is possible to have three topics or many topics? On Monday, 19 June 2017, 20:10, Michael Armbrust wrote: I don't think that there is really a Spark specific limit here. It would be a function of the size of your spark / kafka clusters and the type of processing you are trying to do. On Mon, Jun 19, 2017 at 12:00 PM, Ashok Kumar wrote: Hi Gurus, Within one Spark streaming process how many topics can be handled? I have not tried more than one topic. Thanks
how many topics spark streaming can handle
Hi Gurus, Within one Spark streaming process how many topics can be handled? I have not tried more than one topic. Thanks
Re: Edge Node in Spark
Just Straight Spark please. Also if I run a spark job using Python or Scala using Yarn where the log files are kept in the edge node? Are these under logs directory for yarn? thanks On Tuesday, 6 June 2017, 14:11, Irving Duran wrote: Ashok,Are you working with straight spark or referring to GraphX? Thank You, Irving Duran On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar wrote: Hi, I am a bit confused between Edge node, Edge server and gateway node in Spark. Do these mean the same thing? How does one set up an Edge node to be used in Spark? Is this different from Edge node for Hadoop please? Thanks -- -- - To unsubscribe e-mail: user-unsubscribe@spark.apache. org
Edge Node in Spark
Hi, I am a bit confused between Edge node, Edge server and gateway node in Spark. Do these mean the same thing? How does one set up an Edge node to be used in Spark? Is this different from Edge node for Hadoop please? Thanks - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: High Availability/DR options for Spark applications
Hi, High Availability means that the system including Spark will carry on with minimal disruption in case of active component failure. DR or disaster recovery means total fail-over to another location with its own nodes. HDFS and Spark cluster Thanks On Sunday, 5 February 2017, 20:15, Jacek Laskowski wrote: Hi, I'm not very familiar with "High Availability/DR operations". Could you explain what it is? My very limited understanding of the phrase allows me to think that with YARN and cluster deploy mode you've failure recovery for free so when your drivers dies YARN will attempt to resurrect it a few times. The other "components", i.e. map shuffle stages, partitions/tasks, are handled by Spark itself. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Feb 5, 2017 at 10:11 AM, Ashok Kumar wrote: > Hello, > > What are the practiced High Availability/DR operations for Spark cluster at > the moment. I am specially interested if YARN is used as the resource > manager. > > Thanks - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
High Availability/DR options for Spark applications
Hello, What are the practiced High Availability/DR operations for Spark cluster at the moment. I am specially interested if YARN is used as the resource manager. Thanks
Re: Happy Diwali to those forum members who celebrate this great festival
You are very kind Sir On Sunday, 30 October 2016, 16:42, Devopam Mittra wrote: +1 Thanks and regards Devopam On 30 Oct 2016 9:37 pm, "Mich Talebzadeh" wrote: Enjoy the festive season. Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/ profile/view?id= AAEWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw http://talebzadehmich. wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
Pros and cons of using different persistence layers for Spark
What are the pros and cons of using different persistence layers for Spark, such as S3,Cassandra, and HDFS? Thanks
Re: Design considerations for batch and speed layers
Can one design a fast pipeline with Kafka, Spark streaming and Hbase or something similar? On Friday, 30 September 2016, 17:17, Mich Talebzadeh wrote: I have designed this prototype for a risk business. Here I would like to discuss issues with batch layer. Apologies about being long winded. Business objective Reduce risk in the credit business while making better credit and trading decisions. Specifically, to identify risk trends within certain years of trading data. For example, measure the risk exposure in a give portfolio by industry, region, credit rating and other parameters. At the macroscopic level, analyze data across market sectors, over a given time horizon to asses risk changes DeliverableEnable real time and batch analysis of risk data Batch technology stack usedKafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query tool, Zeppelin Test volumes for POC1 message queue (csv format), 100 stock prices streaming in very 2 seconds, 180K prices per hour, 4 million + per day - prices to Kafka -> Zookeeper -> Flume -> HDFS - HDFS daily partition for that day's data - Hive external table looking at HDFS partitioned location - Hive managed table populated every 15 minutes via cron from Hive external table (table type ORC partitioned by date). This is purely Hive job. Hive table is populated using insert/overwrite for that day to avoid boundary value/missing data etc. - Typical batch ingestion time (Hive table populated from HDFS files) ~ 2 minutes - Data in Hive table has 15 minutes latency - Zeppelin to be used as UI with Spark Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell. Within Spark shell, users can access batch tables in Hive or they have a choice of accessing raw data on HDFS files which gives them real time access (not to be confused with speed layer). Using typical query with Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1 min. Running the same query (my typical query not user query) on Hive tables this time using Spark takes 6 seconds. However, there are some design concerns: - Zeppelin starts slowing down by the end of day. Sometimes it throws broken pipe message. I resolve this by restarting Zeppelin daemon. Potential show stopper - As the volume of data increases throughout the day, performance becomes an issue - Every 15 minutes when the cron starts, Hive insert/overwrites can potentially get in conflict with users throwing queries from Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all users from accessing these tables (at partition level) until insert overwrite is done. This can be improved by better partitioning of Hive tables or relaxing ingestion time to half hour or one hour at a cost of more lagging. I tried Parquet tables in Hive but really no difference in performance gain. I have thought of replacing Hive with Hbase etc. but that brings new complications in as well without necessarily solving the issue. - I am not convinced this design can scale up easily with 5 times more volume of data. - We will also get real time data from RDBMS tables (Oracle, Sybase, MSSQL)using replication technologies such as Sap Replication Server. These currently deliver changed log data to Hive tables. So there is some compatibility issue here. So I am sure some members can add useful ideas :) Thanks Mich LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
Spark Interview questions
Hi, As a learner I appreciate if you have typical Spark interview questions for Spark/Scala junior roles that you can please forward to me. I will be very obliged
Re: dstream.foreachRDD iteration
I have checked that doc sir. My understand every batch interval of data always generates one RDD, So why do we need to use foreachRDD when there is only one. Sorry for this question but bit confusing me. Thanks On Wednesday, 7 September 2016, 18:05, Mich Talebzadeh wrote: Hi, What is so confusing about RDD. Have you checked this doc? http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On 7 September 2016 at 11:39, Ashok Kumar wrote: Hi, A bit confusing to me How many layers involved in DStream.foreachRDD. Do I need to loop over it more than once? I mean DStream.foreachRDD{ rdd = > } I am trying to get individual lines in RDD. Thanks
dstream.foreachRDD iteration
Hi, A bit confusing to me How many layers involved in DStream.foreachRDD. Do I need to loop over it more than once? I mean DStream.foreachRDD{ rdd = > } I am trying to get individual lines in RDD. Thanks
Re: Getting figures from spark streaming
Any help on this warmly appreciated. On Tuesday, 6 September 2016, 21:31, Ashok Kumar wrote: Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I sent 5 prices in each batch interval. In the loop code this is what is hapening dstream.foreachRDD { rdd => val x= rdd.count i += 1 println(s"> rdd loop i is ${i}, number of lines is ${x} <==") if (x > 0) { println(s"processing ${x} records=") var words1 = rdd.map(_._2).map(_.split(',').view(0)).map(_.toInt).collect.apply(0) println (words1) var words2 = rdd.map(_._2).map(_.split(',').view(1)).map(_.toString).collect.apply(0) println (words2) var price = rdd.map(_._2).map(_.split(',').view(2)).map(_.toFloat).collect.apply(0) println (price) rdd.collect.foreach(println) } } My tuple looks like this // (null, "ID TIMESTAMP PRICE")// (null, "40,20160426-080924, 67.55738301621814598514") And this the sample output from the run processing 5 records=320160906-21250980.224686(null,3,20160906-212509,80.22468448052631637099)(null,1,20160906-212509,60.40695324215582386153)(null,4,20160906-212509,61.95159400693415572125)(null,2,20160906-212509,93.05912099305473237788)(null,5,20160906-212509,81.08637370113427387121) Now it does process the first values 3, 20160906-212509, 80.224686 for record (null,3,20160906-212509,80.22468448052631637099) but ignores the rest. of 4 records. How can I make it go through all records here? I want the third column from all records! Greetings
Getting figures from spark streaming
Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I sent 5 prices in each batch interval. In the loop code this is what is hapening dstream.foreachRDD { rdd => val x= rdd.count i += 1 println(s"> rdd loop i is ${i}, number of lines is ${x} <==") if (x > 0) { println(s"processing ${x} records=") var words1 = rdd.map(_._2).map(_.split(',').view(0)).map(_.toInt).collect.apply(0) println (words1) var words2 = rdd.map(_._2).map(_.split(',').view(1)).map(_.toString).collect.apply(0) println (words2) var price = rdd.map(_._2).map(_.split(',').view(2)).map(_.toFloat).collect.apply(0) println (price) rdd.collect.foreach(println) } } My tuple looks like this // (null, "ID TIMESTAMP PRICE")// (null, "40,20160426-080924, 67.55738301621814598514") And this the sample output from the run processing 5 records=320160906-21250980.224686(null,3,20160906-212509,80.22468448052631637099)(null,1,20160906-212509,60.40695324215582386153)(null,4,20160906-212509,61.95159400693415572125)(null,2,20160906-212509,93.05912099305473237788)(null,5,20160906-212509,81.08637370113427387121) Now it does process the first values 3, 20160906-212509, 80.224686 for record (null,3,20160906-212509,80.22468448052631637099) but ignores the rest. of 4 records. How can I make it go through all records here? I want the third column from all records! Greetings
Re: Splitting columns from a text file
Thanks everyone. I am not skilled like you gentlemen This is what I did 1) Read the text file val textFile = sc.textFile("/tmp/myfile.txt") 2) That produces an RDD of String. 3) Create a DF after splitting the file into an Array val df = textFile.map(line => line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF 4) Create a class for column headers case class Columns(col1: Int, col2: String, col3: Double) 5) Assign the column headers val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, p(2).toString.toDouble)) 6) Only interested in column 3 > 50 h.filter(col("Col3") > 50.0) 7) Now I just want Col3 only h.filter(col("Col3") > 50.0).select("col3").show(5)+-+| col3|+-+|95.42536350467836||61.56297588648554||76.73982017179868||68.86218120274728||67.64613810115105|+-+only showing top 5 rows Does that make sense. Are there shorter ways gurus? Can I just do all this on RDD without DF? Thanking you On Monday, 5 September 2016, 15:19, ayan guha wrote: Then, You need to refer third term in the array, convert it to your desired data type and then use filter. On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar wrote: Hi,I want to filter them for values. This is what is in array 74,20160905-133143,98. 11218069128827594148 I want to filter anything > 50.0 in the third column Thanks On Monday, 5 September 2016, 15:07, ayan guha wrote: Hi x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map? On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar wrote: Thank you sir. This is what I get scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at :27 How can I work on individual columns. I understand they are strings scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | ):27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0)) regards On Monday, 5 September 2016, 13:51, Somasundaram Sekar wrote: Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 636895265444075228,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000 val textFile = sc.textFile("/tmp/mytextfile. txt") Now I want to split the rows separated by "," scala> textFile.map(x=>x.toString). split(","):27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",") However, the above throws error? Any ideas what is wrong or how I can do this if I can avoid converting it to String? Thanking -- Best Regards, Ayan Guha -- Best Regards, Ayan Guha
Re: Splitting columns from a text file
Hi,I want to filter them for values. This is what is in array 74,20160905-133143,98.11218069128827594148 I want to filter anything > 50.0 in the third column Thanks On Monday, 5 September 2016, 15:07, ayan guha wrote: Hi x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map? On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar wrote: Thank you sir. This is what I get scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at :27 How can I work on individual columns. I understand they are strings scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | ):27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0)) regards On Monday, 5 September 2016, 13:51, Somasundaram Sekar wrote: Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 636895265444075228,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000 val textFile = sc.textFile("/tmp/mytextfile. txt") Now I want to split the rows separated by "," scala> textFile.map(x=>x.toString). split(","):27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",") However, the above throws error? Any ideas what is wrong or how I can do this if I can avoid converting it to String? Thanking -- Best Regards, Ayan Guha
Re: Splitting columns from a text file
Thank you sir. This is what I get scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[27] at map at :27 How can I work on individual columns. I understand they are strings scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0)) | ):27: error: value getString is not a member of Array[String] textFile.map(x=> x.split(",")).map(x => (x.getString(0)) regards On Monday, 5 September 2016, 13:51, Somasundaram Sekar wrote: Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x => x.split(",") On 5 Sep 2016 6:19 pm, "Ashok Kumar" wrote: Hi, I have a text file as below that I read in 74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56. 0802995712398098455677,20160905-133143,46. 636895265444075228,20160905-133143,84. 8822714116440218155179,20160905-133143,68. 72408602520662115000 val textFile = sc.textFile("/tmp/mytextfile. txt") Now I want to split the rows separated by "," scala> textFile.map(x=>x.toString). split(","):27: error: value split is not a member of org.apache.spark.rdd.RDD[ String] textFile.map(x=>x.toString). split(",") However, the above throws error? Any ideas what is wrong or how I can do this if I can avoid converting it to String? Thanking
Splitting columns from a text file
Hi, I have a text file as below that I read in 74,20160905-133143,98.1121806912882759414875,20160905-133143,49.5277699881591680774276,20160905-133143,56.0802995712398098455677,20160905-133143,46.636895265444075228,20160905-133143,84.8822714116440218155179,20160905-133143,68.72408602520662115000 val textFile = sc.textFile("/tmp/mytextfile.txt") Now I want to split the rows separated by "," scala> textFile.map(x=>x.toString).split(","):27: error: value split is not a member of org.apache.spark.rdd.RDD[String] textFile.map(x=>x.toString).split(",") However, the above throws error? Any ideas what is wrong or how I can do this if I can avoid converting it to String? Thanking
Difference between Data set and Data Frame in Spark 2
Hi, What are practical differences between the new Data set in Spark 2 and the existing DataFrame. Has Dataset replaced Data Frame and what advantages it has if I use Data Frame instead of Data Frame. Thanks
Design patterns involving Spark
Hi, There are design patterns that use Spark extensively. I am new to this area so I would appreciate if someone explains where Spark fits in especially within faster or streaming use case. What are the best practices involving Spark. Is it always best to deploy it for processing engine, For example when we have a pattern Input Data -> Data in Motion -> Processing -> Storage Where does Spark best fit in. Thanking you
Spark standalone or Yarn for resourcing
Hi, for small to medium size clusters I think Spark Standalone mode is a good choice. We are contemplating moving to Yarn as our cluster grows. What are the pros and cons of using either please. Which one offers the best Thanking you
Re: parallel processing with JDBC
Thank you very much sir. I forgot to mention that two of these Oracle tables are range partitioned. In that case what would be the optimum number of partitions if you can share? Warmest On Sunday, 14 August 2016, 21:37, Mich Talebzadeh wrote: If you have primary keys on these tables then you can parallelise the process reading data. You have to be careful not to set the number of partitions too many. Certainly there is a balance between the number of partitions supplied to JDBC and the load on the network and the source DB. Assuming that your underlying table has primary key ID, then this will create 20 parallel processes to Oracle DB val d = HiveContext.read.format("jdbc").options( Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT , , FROM )", "partitionColumn" -> "ID", "lowerBound" -> "1", "upperBound" -> "maxID", "numPartitions" -> "20", "user" -> _username, "password" -> _password)).load assuming your upper bound on ID is maxID This will open multiple connections to RDBMS, each getting a subset of data that you want. You need to test it to ensure that you get the numPartitions optimum and you don't overload any component. HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On 14 August 2016 at 21:15, Ashok Kumar wrote: Hi, There are 4 tables ranging from 10 million to 100 million rows but they all have primary keys. The network is fine but our Oracle is RAC and we can only connect to a designated Oracle node (where we have a DQ account only). We have a limited time window of few hours to get the required data out. Thanks On Sunday, 14 August 2016, 21:07, Mich Talebzadeh wrote: How big are your tables and is there any issue with the network between your Spark nodes and your Oracle DB that adds to issues? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/ profile/view?id= AAEWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw http://talebzadehmich. wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On 14 August 2016 at 20:50, Ashok Kumar wrote: Hi Gurus, I have few large tables in rdbms (ours is Oracle). We want to access these tables through Spark JDBC What is the quickest way of getting data into Spark Dataframe say multiple connections from Spark thanking you
Re: parallel processing with JDBC
Hi, There are 4 tables ranging from 10 million to 100 million rows but they all have primary keys. The network is fine but our Oracle is RAC and we can only connect to a designated Oracle node (where we have a DQ account only). We have a limited time window of few hours to get the required data out. Thanks On Sunday, 14 August 2016, 21:07, Mich Talebzadeh wrote: How big are your tables and is there any issue with the network between your Spark nodes and your Oracle DB that adds to issues? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On 14 August 2016 at 20:50, Ashok Kumar wrote: Hi Gurus, I have few large tables in rdbms (ours is Oracle). We want to access these tables through Spark JDBC What is the quickest way of getting data into Spark Dataframe say multiple connections from Spark thanking you
parallel processing with JDBC
Hi Gurus, I have few large tables in rdbms (ours is Oracle). We want to access these tables through Spark JDBC What is the quickest way of getting data into Spark Dataframe say multiple connections from Spark thanking you
num-executors, executor-memory and executor-cores parameters
Hi I would like to know the exact definition for these three parameters num-executors executor-memory executor-cores for local, standalone and yarn modes I have looked at on-line doc but not convinced if I understand them correct. Thanking you
Windows operation orderBy desc
Hi, in the following Window spec I want orderBy ("") to be displayed in descending order please val W = Window.partitionBy("col1").orderBy("col2") If I Do val W = Window.partitionBy("col1").orderBy("col2".desc) It throws error console>:26: error: value desc is not a member of String How can I achieve that? Thanking you
The main difference use case between orderBY and sort
Hi, In Spark programing I can use df.filter(col("transactiontype") === "DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total Debit Card")).orderBy("transactiondate").show(5) or df.filter(col("transactiontype") === "DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total Debit Card")).sort("transactiondate").show(5) i get the same results and i can use both as well df.ilter(col("transactiontype") === "DEB").groupBy("transactiondate").agg(sum("debitamount").cast("Float").as("Total Debit Card")).orderBy("transactiondate").sort("transactiondate").show(5) but the last one takes more time. what is the use case for both these please. does it make sense to use both? Thanks
Re: Presentation in London: Running Spark on Hive or Hive on Spark
Thanks Mich looking forward to it :) On Tuesday, 19 July 2016, 19:13, Mich Talebzadeh wrote: Hi all, This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station, Jubilee Line If you wish you can register and get more info here It will be in La Tasca West India Docks Road E14 and especially if you like Spanish food :) Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On 15 July 2016 at 11:06, Joaquin Alzola wrote: It is on the 20th (Wednesday) next week. From: Marco Mistroni [mailto:mmistr...@gmail.com] Sent: 15 July 2016 11:04 To: Mich Talebzadeh Cc: user @spark ; user Subject: Re: Presentation in London: Running Spark on Hive or Hive on Spark Dr Mich do you have any slides or videos available for the presentation you did @Canary Wharf?kindest regards marco On Wed, Jul 6, 2016 at 10:37 PM, Mich Talebzadeh wrote: Dear forum members I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, your mileage varies" in Future of Data: London DetailsOrganized by: HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM Place: LondonLocation: One Canada Square, Canary Wharf, London E14 5AB.Nearest Underground: Canary Warf (map)If you are interested please register hereLooking forward to seeing those who can make it to have an interesting discussion and leverage your experience.Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk.Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
Re: Fast database with writes per second and horizontal scaling
Anyone in Spark as well My colleague has been using Cassandra. However, he says it is too slow and not user friendly/MongodDB as a doc databases is pretty neat but not fast enough May main concern is fast writes per second and good scaling. Hive on Spark or Tez? How about Hbase. or anything else Any expert advice warmly acknowledged.. thanking yo On Monday, 11 July 2016, 17:24, Ashok Kumar wrote: Hi Gurus, Advice appreciated from Hive gurus. My colleague has been using Cassandra. However, he says it is too slow and not user friendly/MongodDB as a doc databases is pretty neat but not fast enough May main concern is fast writes per second and good scaling. Hive on Spark or Tez? How about Hbase. or anything else Any expert advice warmly acknowledged.. thanking you
Re: Using Spark on Hive with Hive also using Spark as its execution engine
Hi Mich, Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark" Have you made any more interesting findings that you like to bring up? If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP is going to be a better choice from what you mentioned? thanking you On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh wrote: Couple of points if I may and kindly bear with my remarks. Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP "Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called LLAP (Live Long and Process, #llap online). LLAP is an optional daemon process running on multiple nodes, that provides the following: - Caching and data reuse across queries with compressed columnar data in-memory (off-heap) - Multi-threaded execution including reads with predicate pushdown and hash joins - High throughput IO using Async IO Elevator with dedicated thread and core per disk - Granular column level security across applications - " OK so we have added an in-memory capability to TEZ by way of LLAP, In other words what Spark does already and BTW it does not require a daemon running on any host. Don't take me wrong. It is interesting but this sounds to me (without testing myself) adding caching capability to TEZ to bring it on par with SPARK. Remember: Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory caching OK it is another way getting the same result. However, my concerns: - Spark has a wide user base. I judge this from Spark user group traffic - TEZ user group has no traffic I am afraid - LLAP I don't know Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything about Hive. and they promote Impala but that sounds like a sinking ship these days. Having said that I will try TEZ + LLAP :) No pun intended Regards Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 31 May 2016 at 08:19, Jörn Franke wrote: Thanks very interesting explanation. Looking forward to test it. > On 31 May 2016, at 07:51, Gopal Vijayaraghavan wrote: > > >> That being said all systems are evolving. Hive supports tez+llap which >> is basically the in-memory support. > > There is a big difference between where LLAP & SparkSQL, which has to do > with access pattern needs. > > The first one is related to the lifetime of the cache - the Spark RDD > cache is per-user-session which allows for further operation in that > session to be optimized. > > LLAP is designed to be hammered by multiple user sessions running > different queries, designed to automate the cache eviction & selection > process. There's no user visible explicit .cache() to remember - it's > automatic and concurrent. > > My team works with both engines, trying to improve it for ORC, but the > goals of both are different. > > I will probably have to write a proper academic paper & get it > edited/reviewed instead of send my ramblings to the user lists like this. > Still, this needs an example to talk about. > > To give a qualified example, let's leave the world of single use clusters > and take the use-case detailed here > > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/ > > > There are two distinct problems there - one is that a single day sees upto > 100k independent user sessions running queries and that most queries cover > the last hour (& possibly join/compare against a similar hour aggregate > from the past). > > The problem with having independent 100k user-sessions from different > connections was that the SparkSQL layer drops the RDD lineage & cache > whenever a user ends a session. > > The scale problem in general for Impala was that even though the data size > was in multiple terabytes, the actual hot data was approx <20Gb, which > resides on <10 machines with locality. > > The same problem applies when you apply RDD caching with something like > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding > popular that the machines which hold those blocks run extra hot. > > A cache model per-user session is entirely wasteful and a common cache + > MPP model effectively overloads 2-3% of cluster, while leaving the other > machines idle. > > LLAP was designed specifically to prevent that hotspotting, while > maintaining the common cache model - within a few minutes after an hour > ticks over, the whole cluster develops temporal popularity for the hot > data and nearly every rack has at least one cached copy of the same data > for availability/performance. > > Since data stream tend to be extr
Re: Spark as sql engine on S3
Hi As I said we have using Hive asour SQL engine for the datasets but we are storing data externally in amazonS3, Now you suggested Spark thrift server. Started Spark thrift server on port 10001 and I have used beeline that accesses thrift server. Connecting to jdbc:hive2://,host>:10001Connected to: Spark SQL (version 1.6.1)Driver: Spark Project Core (version 1.6.1)Transaction isolation: TRANSACTION_REPEATABLE_READBeeline version 1.6.1 by Apache Hive Now I just need to access my external tables on S3 as I do it on Hive with beeline connected to Hive thrift server? The advantage is that using Spark SQL will be much faster? regards On Friday, 8 July 2016, 6:30, ayan guha wrote: Yes, it can. On Fri, Jul 8, 2016 at 3:03 PM, Ashok Kumar wrote: thanks so basically Spark Thrift Server runs on a port much like beeline that uses JDBC to connect to Hive? Can Spark thrift server access Hive tables? regards On Friday, 8 July 2016, 5:27, ayan guha wrote: Spark Thrift Server..works as jdbc server. you can connect to it from any jdbc tool like squirrel On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar wrote: Hello gurus, We are storing data externally on Amazon S3 What is the optimum or best way to use Spark as SQL engine to access data on S3? Any info/write up will be greatly appreciated. Regards -- Best Regards, Ayan Guha -- Best Regards, Ayan Guha
Re: Spark as sql engine on S3
thanks so basically Spark Thrift Server runs on a port much like beeline that uses JDBC to connect to Hive? Can Spark thrift server access Hive tables? regards On Friday, 8 July 2016, 5:27, ayan guha wrote: Spark Thrift Server..works as jdbc server. you can connect to it from any jdbc tool like squirrel On Fri, Jul 8, 2016 at 3:50 AM, Ashok Kumar wrote: Hello gurus, We are storing data externally on Amazon S3 What is the optimum or best way to use Spark as SQL engine to access data on S3? Any info/write up will be greatly appreciated. Regards -- Best Regards, Ayan Guha
Re: Presentation in London: Running Spark on Hive or Hive on Spark
Thanks. Will this presentation recorded as well? Regards On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh wrote: Dear forum members I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, your mileage varies" in Future of Data: London DetailsOrganized by: HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM Place: LondonLocation: One Canada Square, Canary Wharf, London E14 5AB.Nearest Underground: Canary Warf (map) If you are interested please register hereLooking forward to seeing those who can make it to have an interesting discussion and leverage your experience.Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
Spark as sql engine on S3
Hello gurus, We are storing data externally on Amazon S3 What is the optimum or best way to use Spark as SQL engine to access data on S3? Any info/write up will be greatly appreciated. Regards
ORC or parquet with Spark
With Spark caching which file format is best to use parquet or ORC Obviously ORC can be used with Hive. My question is whether Spark can use various file, stripe rowset statistics stored in ORC file? Otherwise to me both parquet and ORC are files simply kept on HDFS. They do not offer any caching to be faster. So if Spark ignores the underlying stats for ORC files, does it matter which file format to use with Spark. Thanks
latest version of Spark to work OK as Hive engine
Hi, Looking at this presentation Hive on Spark is Blazing Fast .. Which latest version of Spark can run as an engine for Hive please? Thanks P.S. I am aware of Hive on TEZ but that is not what I am interested here please Warmest regards
JDBC load into tempTable
Hi, I have a SQL server table with 500,000,000 rows with primary key (unique clustered index) on ID column If I load it through JDBC into a DataFrame and register it via registerTempTable will the data will be in the order of ID in tempTable? Thanks
Re: Running Spark in local mode
Thank you all sirs Appreciated Mich your clarification. On Sunday, 19 June 2016, 19:31, Mich Talebzadeh wrote: Thanks Jonathan for your points I am aware of the fact yarn-client and yarn-cluster are both depreciated (still work in 1.6.1), hence the new nomenclature. Bear in mind this is what I stated in my notes: "YARN Cluster Mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. This is invoked with –master yarn and --deploy-mode cluster - YARN Client Mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. - - Unlike Spark standalone mode, in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn. This is invoked with --deploy-mode client" These are exactly from Spark document and I quote "There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. Unlike Spark standalone and Mesos modes, in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn." Cheers Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 19 June 2016 at 19:09, Jonathan Kelly wrote: Mich, what Jacek is saying is not that you implied that YARN relies on two masters. He's just clarifying that yarn-client and yarn-cluster modes are really both using the same (type of) master (simply "yarn"). In fact, if you specify "--master yarn-client" or "--master yarn-cluster", spark-submit will translate that into using a master URL of "yarn" and a deploy-mode of "client" or "cluster". And thanks, Jacek, for the tips on the "less-common master URLs". I had no idea that was an option! ~ Jonathan On Sun, Jun 19, 2016 at 4:13 AM Mich Talebzadeh wrote: Good points but I am an experimentalist In Local mode I have this In local mode with:--master local This will start with one thread or equivalent to –master local[1]. Youcan also start by more than one thread by specifying the number of threads k in –master local[k]. You can also start using all available threads with –master local[*]which in mine would be local[12]. The important thing about Local mode is that number of JVM thrown is controlled by you and you can start as many spark-submit as you wish within constraint of what you get ${SPARK_HOME}/bin/spark-submit\ --packagescom.databricks:spark-csv_2.11:1.3.0 \ --driver-memory 2G \ --num-executors 1 \ --executor-memory 2G \ --master local \ --executor-cores 2 \ --conf"spark.scheduler.mode=FIFO" \ --conf"spark.executor.extraJavaOptions=-XX:+PrintGCDetails-XX:+PrintGCTimeStamps" \ --jars/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \ --class"${FILE_NAME}" \ --conf "spark.ui.port=4040” \ ${JAR_FILE} \ >> ${LOG_FILE} Now that does work fine although some of those parameters are implicit (for example cheduler.mode = FIFOR or FAIR and I can start different spark jobs in Local mode. Great for testing. With regard to your comments on Standalone Spark Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster. s/simple/built-inWhat is stated as "included" implies that, i.e. it comes as part of running Spark in standalone mode. Your other points on YARN cluster mode and YARN client mode I'd say there's only one YARN master, i.e. --master yarn. You could however say where the driver runs, be it on your local machine where you executed spark-submit or on one node in a YARN cluster. Yes that is I believe what the text implied. I would be very surprised if YARN as a resource manager relies on two masters :) HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 19 June 2016 at 11:46, Jacek Laskowski wrote: On Sun, Jun 19, 2016 at 12:30 PM, Mich Talebzadeh wrote: > Spark Local - Spark runs on the local host. This is the simplest set up and > best suited for learners who want to understand different concepts of Spark > and those performing unit testing. There are a
Re: Running Spark in local mode
thank you What are the main differences between a local mode and standalone mode. I understand local mode does not support cluster. Is that the only difference? On Sunday, 19 June 2016, 9:52, Takeshi Yamamuro wrote: Hi, In a local mode, spark runs in a single JVM that has a master and one executor with `k` threads.https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L94 // maropu On Sun, Jun 19, 2016 at 5:39 PM, Ashok Kumar wrote: Hi, I have been told Spark in Local mode is simplest for testing. Spark document covers little on local mode except the cores used in --master local[k]. Where are the the driver program, executor and resources. Do I need to start worker threads and how many app I can use safely without exceeding memory allocated etc? Thanking you -- --- Takeshi Yamamuro
Running Spark in local mode
Hi, I have been told Spark in Local mode is simplest for testing. Spark document covers little on local mode except the cores used in --master local[k]. Where are the the driver program, executor and resources. Do I need to start worker threads and how many app I can use safely without exceeding memory allocated etc? Thanking you
Re: Running Spark in Standalone or local modes
Thanks Mich. Great explanation On Saturday, 11 June 2016, 22:35, Mich Talebzadeh wrote: Hi Gavin, I believe in standalone mode a simple cluster manager is included with Spark that makes it easyto set up a cluster.It does not rely on YARN or Mesos. In summary this is from my notes: - Spark Local - Spark runs on the localhost. This is the simplest set up and best suited for learners who want to understanddifferent concepts of Spark and those performing unit testing. - Spark Standalone – a simple cluster managerincluded with Spark that makes it easy to set up a cluster. - YARN Cluster Mode, the Spark driver runs inside anapplication master process which is managed by YARN on the cluster, and theclient can go away after initiating the application. - Mesos. I have not used it so cannot comment YARN Client Mode, the driver runs inthe client process, and the application master is only used for requestingresources from YARN. UnlikeLocal or Spark standalone modes, in which the master’s address isspecified in the --master parameter, in YARNmode the ResourceManager’s address is picked up from the Hadoop configuration.Thus, the --master parameter is yarn HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 11 June 2016 at 22:26, Gavin Yue wrote: The standalone mode is against Yarn mode or Mesos mode, which means spark uses Yarn or Mesos as cluster managements. Local mode is actually a standalone mode which everything runs on the single local machine instead of remote clusters. That is my understanding. On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar wrote: Thank you for grateful I know I can start spark-shell by launching the shell itself spark-shell Now I know that in standalone mode I can also connect to master spark-shell --master spark://:7077 My point is what are the differences between these two start-up modes for spark-shell? If I start spark-shell and connect to master what performance gain will I get if any or it does not matter. Is it the same as for spark-submit regards On Saturday, 11 June 2016, 19:39, Mohammad Tariq wrote: Hi Ashok, In local mode all the processes run inside a single jvm, whereas in standalone mode we have separate master and worker processes running in their own jvms. To quickly test your code from within your IDE you could probable use the local mode. However, to get a real feel of how Spark operates I would suggest you to have a standalone setup as well. It's just the matter of launching a standalone cluster either manually(by starting a master and workers by hand), or by using the launch scripts provided with Spark package. You can find more on this here. HTH | | | | Tariq, Mohammad | about.me/mti | | | | | | | On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar wrote: Hi, What is the difference between running Spark in Local mode or standalone mode? Are they the same. If they are not which is best suited for non prod work. I am also aware that one can run Spark in Yarn mode as well. Thanks
Re: Running Spark in Standalone or local modes
Thank you for grateful I know I can start spark-shell by launching the shell itself spark-shell Now I know that in standalone mode I can also connect to master spark-shell --master spark://:7077 My point is what are the differences between these two start-up modes for spark-shell? If I start spark-shell and connect to master what performance gain will I get if any or it does not matter. Is it the same as for spark-submit regards On Saturday, 11 June 2016, 19:39, Mohammad Tariq wrote: Hi Ashok, In local mode all the processes run inside a single jvm, whereas in standalone mode we have separate master and worker processes running in their own jvms. To quickly test your code from within your IDE you could probable use the local mode. However, to get a real feel of how Spark operates I would suggest you to have a standalone setup as well. It's just the matter of launching a standalone cluster either manually(by starting a master and workers by hand), or by using the launch scripts provided with Spark package. You can find more on this here. HTH | | | | Tariq, Mohammad | about.me/mti | | | | | | | On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar wrote: Hi, What is the difference between running Spark in Local mode or standalone mode? Are they the same. If they are not which is best suited for non prod work. I am also aware that one can run Spark in Yarn mode as well. Thanks
Running Spark in Standalone or local modes
Hi, What is the difference between running Spark in Local mode or standalone mode? Are they the same. If they are not which is best suited for non prod work. I am also aware that one can run Spark in Yarn mode as well. Thanks
Fw: Basic question on using one's own classes in the Scala app
Anyone can help me with this please On Sunday, 5 June 2016, 11:06, Ashok Kumar wrote: Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and methods as I expand and make references to these classes and methods in my other apps class getCheckpointDirectory { def CheckpointDirectory (ProgramName: String) : String = { var hdfsDir = "hdfs://host:9000/user/user/checkpoint/"+ProgramName return hdfsDir }}I have used sbt to create a jar file for it. It is created as a jar file utilities-assembly-0.1-SNAPSHOT.jar Now I want to make a call to that method CheckpointDirectory in my app code myapp.dcala to return the value for hdfsDir val ProgramName = this.getClass.getSimpleName.trim val getCheckpointDirectory = new getCheckpointDirectory val hdfsDir = getCheckpointDirectory.CheckpointDirectory(ProgramName) However, I am getting a compilation error as expected not found: type getCheckpointDirectory[error] val getCheckpointDirectory = new getCheckpointDirectory[error] ^[error] one error found[error] (compile:compileIncremental) Compilation failed So a basic question, in order for compilation to work do I need to create a package for my jar file or add dependency like the following I do in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" Or add the jar file to $CLASSPATH? Any advise will be appreciated. Thanks
Re: Basic question on using one's own classes in the Scala app
Thank you. I added this as dependency libraryDependencies += "com.databricks" % "apps.twitter_classifier" % "1.0.0" That number at the end I chose arbitrary? Is that correct Also in my TwitterAnalyzer.scala I added this linw import com.databricks.apps.twitter_classifier._ Now I am getting this error [info] Resolving com.databricks#apps.twitter_classifier;1.0.0 ...[warn] module not found: com.databricks#apps.twitter_classifier;1.0.0[warn] local: tried[warn] /home/hduser/.ivy2/local/com.databricks/apps.twitter_classifier/1.0.0/ivys/ivy.xml[warn] public: tried[warn] https://repo1.maven.org/maven2/com/databricks/apps.twitter_classifier/1.0.0/apps.twitter_classifier-1.0.0.pom[info] Resolving org.fusesource.jansi#jansi;1.4 ...[warn] ::[warn] :: UNRESOLVED DEPENDENCIES ::[warn] ::[warn] :: com.databricks#apps.twitter_classifier;1.0.0: not found[warn] ::[warn][warn] Note: Unresolved dependencies path:[warn] com.databricks:apps.twitter_classifier:1.0.0 (/home/hduser/scala/TwitterAnalyzer/build.sbt#L18-19)[warn] +- scala:scala_2.10:1.0sbt.ResolveException: unresolved dependency: com.databricks#apps.twitter_classifier;1.0.0: not found Any ideas? regards, On Sunday, 5 June 2016, 22:22, Jacek Laskowski wrote: On Sun, Jun 5, 2016 at 9:01 PM, Ashok Kumar wrote: > Now I have added this > > libraryDependencies += "com.databricks" % "apps.twitter_classifier" > > However, I am getting an error > > > error: No implicit for Append.Value[Seq[sbt.ModuleID], > sbt.impl.GroupArtifactID] found, > so sbt.impl.GroupArtifactID cannot be appended to Seq[sbt.ModuleID] > libraryDependencies += "com.databricks" % "apps.twitter_classifier" > ^ > [error] Type error in expression > Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? Missing version element, e.g. libraryDependencies += "com.databricks" % "apps.twitter_classifier" % "VERSION_HERE" Jacek - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Basic question on using one's own classes in the Scala app
Hello for 1, I read the doc as libraryDependencies += groupID % artifactID % revision jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep CheckpointDirectory com/databricks/apps/twitter_classifier/getCheckpointDirectory.class getCheckpointDirectory.class Now I have added this libraryDependencies += "com.databricks" % "apps.twitter_classifier" However, I am getting an error error: No implicit for Append.Value[Seq[sbt.ModuleID], sbt.impl.GroupArtifactID] found, so sbt.impl.GroupArtifactID cannot be appended to Seq[sbt.ModuleID] libraryDependencies += "com.databricks" % "apps.twitter_classifier" ^ [error] Type error in expression Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? Any ideas very appreciated Thanking yoou On Sunday, 5 June 2016, 17:39, Ted Yu wrote: For #1, please find examples on the nete.g. http://www.scala-sbt.org/0.13/docs/Scala-Files-Example.html For #2, import . getCheckpointDirectory Cheers On Sun, Jun 5, 2016 at 8:36 AM, Ashok Kumar wrote: Thank you sir. At compile time can I do something similar to libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" I have these name := "scala" version := "1.0" scalaVersion := "2.10.4" And if I look at jar file i have jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep Check 1180 Sun Jun 05 10:14:36 BST 2016 com/databricks/apps/twitter_classifier/getCheckpointDirectory.class 1043 Sun Jun 05 10:14:36 BST 2016 getCheckpointDirectory.class 1216 Fri Sep 18 09:12:40 BST 2015 scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask$class.class 615 Fri Sep 18 09:12:40 BST 2015 scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask.class two questions please What do I need to put in libraryDependencies line and what do I need to add to the top of scala app like import java.io.Fileimport org.apache.log4j.Loggerimport org.apache.log4j.Levelimport ? Thanks On Sunday, 5 June 2016, 15:21, Ted Yu wrote: At compilation time, you need to declare the dependence on getCheckpointDirectory. At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass the jar. Cheers On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar wrote: Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and methods as I expand and make references to these classes and methods in my other apps class getCheckpointDirectory { def CheckpointDirectory (ProgramName: String) : String = { var hdfsDir = "hdfs://host:9000/user/user/checkpoint/"+ProgramName return hdfsDir }}I have used sbt to create a jar file for it. It is created as a jar file utilities-assembly-0.1-SNAPSHOT.jar Now I want to make a call to that method CheckpointDirectory in my app code myapp.dcala to return the value for hdfsDir val ProgramName = this.getClass.getSimpleName.trim val getCheckpointDirectory = new getCheckpointDirectory val hdfsDir = getCheckpointDirectory.CheckpointDirectory(ProgramName) However, I am getting a compilation error as expected not found: type getCheckpointDirectory[error] val getCheckpointDirectory = new getCheckpointDirectory[error] ^[error] one error found[error] (compile:compileIncremental) Compilation failed So a basic question, in order for compilation to work do I need to create a package for my jar file or add dependency like the following I do in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" Any advise will be appreciated. Thanks
Re: Basic question on using one's own classes in the Scala app
Thank you sir. At compile time can I do something similar to libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" I have these name := "scala" version := "1.0" scalaVersion := "2.10.4" And if I look at jar file i have jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep Check 1180 Sun Jun 05 10:14:36 BST 2016 com/databricks/apps/twitter_classifier/getCheckpointDirectory.class 1043 Sun Jun 05 10:14:36 BST 2016 getCheckpointDirectory.class 1216 Fri Sep 18 09:12:40 BST 2015 scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask$class.class 615 Fri Sep 18 09:12:40 BST 2015 scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask.class two questions please What do I need to put in libraryDependencies line and what do I need to add to the top of scala app like import java.io.Fileimport org.apache.log4j.Loggerimport org.apache.log4j.Levelimport ? Thanks On Sunday, 5 June 2016, 15:21, Ted Yu wrote: At compilation time, you need to declare the dependence on getCheckpointDirectory. At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass the jar. Cheers On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar wrote: Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and methods as I expand and make references to these classes and methods in my other apps class getCheckpointDirectory { def CheckpointDirectory (ProgramName: String) : String = { var hdfsDir = "hdfs://host:9000/user/user/checkpoint/"+ProgramName return hdfsDir }}I have used sbt to create a jar file for it. It is created as a jar file utilities-assembly-0.1-SNAPSHOT.jar Now I want to make a call to that method CheckpointDirectory in my app code myapp.dcala to return the value for hdfsDir val ProgramName = this.getClass.getSimpleName.trim val getCheckpointDirectory = new getCheckpointDirectory val hdfsDir = getCheckpointDirectory.CheckpointDirectory(ProgramName) However, I am getting a compilation error as expected not found: type getCheckpointDirectory[error] val getCheckpointDirectory = new getCheckpointDirectory[error] ^[error] one error found[error] (compile:compileIncremental) Compilation failed So a basic question, in order for compilation to work do I need to create a package for my jar file or add dependency like the following I do in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" Any advise will be appreciated. Thanks
Basic question on using one's own classes in the Scala app
Hi all, Appreciate any advice on this. It is about scala I have created a very basic Utilities.scala that contains a test class and method. I intend to add my own classes and methods as I expand and make references to these classes and methods in my other apps class getCheckpointDirectory { def CheckpointDirectory (ProgramName: String) : String = { var hdfsDir = "hdfs://host:9000/user/user/checkpoint/"+ProgramName return hdfsDir }}I have used sbt to create a jar file for it. It is created as a jar file utilities-assembly-0.1-SNAPSHOT.jar Now I want to make a call to that method CheckpointDirectory in my app code myapp.dcala to return the value for hdfsDir val ProgramName = this.getClass.getSimpleName.trim val getCheckpointDirectory = new getCheckpointDirectory val hdfsDir = getCheckpointDirectory.CheckpointDirectory(ProgramName) However, I am getting a compilation error as expected not found: type getCheckpointDirectory[error] val getCheckpointDirectory = new getCheckpointDirectory[error] ^[error] one error found[error] (compile:compileIncremental) Compilation failed So a basic question, in order for compilation to work do I need to create a package for my jar file or add dependency like the following I do in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" Any advise will be appreciated. Thanks
processing twitter data
hi all, i know very little about the subject. we would like to get streaming data from twitter and facebook. so questions please may i - what format is data from twitter. is it jason format - can i use spark and spark streaming for analyzing data - can data be fed in/streamed via kafka from twitter - what would be the optimum batch interval, windows interval and windows sliding interval? - what is the best method of storing this data in a database. can i use hive tables for it and which one is most stuable please thanking you
Does Spark support updates or deletes on underlying Hive tables
Hi, I can do inserts from Spark on Hive tables. How about updates or deletes. They are failing when I tried? Thanking
Using Java in Spark shell
Hello, A newbie question. Is it possible to use java code directly in spark shell without using maven to build a jar file? How can I switch from scala to java in spark shell? Thanks
Re: Using Spark on Hive with Hive also using Spark as its execution engine
Hi Dr Mich, This is very good news. I will be interested to know how Hive engages with Spark as an engine. What Spark processes are used to make this work? Thanking you On Monday, 23 May 2016, 19:01, Mich Talebzadeh wrote: Have a look at this thread Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 23 May 2016 at 09:10, Mich Talebzadeh wrote: Hi Timur and everyone. I will answer your first question as it is very relevant 1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark installations simultaneously There are two distinct points here. Using Spark as a query engine. That is BAU and most forum members use it everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster managers. You start master that does the management of resources and you start slaves to create workers. You deploy Spark either by Spark-shell, Spark-sql or submit jobs through spark-submit etc. You may or may not use Hive as your database. You may use Hbase via Phoenix etcIf you choose to use Hive as your database, on every host of cluster including your master host, you ensure that Hive APIs are installed (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to cd $SPARK_HOME/conf hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml lrwxrwxrwx 1 hduser hadoop 32 May 3 17:48 hive-site.xml -> /usr/lib/hive/conf/hive-site.xml Now in hive-site.xml you can define all the parameters needed for Spark connectivity. Remember we are making Hive use spark1.3.1 engine. WE ARE NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or workers for Spark 1.3.1! It is just an execution engine like mr etc. Let us look at how we do that in hive-site,xml. Noting the settings for hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 below. That tells Hive to use spark 1.3.1 as the execution engine. You just install spark 1.3.1 on the host just the binary download it is /usr/lib/spark-1.3.1-bin-hadoop2.6 In hive-site.xml, you set the properties. hive.execution.engine spark Expects one of [mr, tez, spark]. Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR remains the default engine for historical reasons, it is itself a historical engine and is deprecated in Hive 2 line. It may be removed without further warning. spark.home /usr/lib/spark-1.3.1-bin-hadoop2 something hive.merge.sparkfiles false Merge small files at the end of a Spark DAG Transformation hive.spark.client.future.timeout 60s Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified. Timeout for requests from Hive client to remote Spark driver. hive.spark.job.monitor.timeout 60s Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified. Timeout for job monitor to get Spark job state. hive.spark.client.connect.timeout 1000ms Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified. Timeout for remote Spark driver in connecting back to Hive client. hive.spark.client.server.connect.timeout 9ms Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified. Timeout for handshake between Hive client and remote Spark driver. Checked by both processes. hive.spark.client.secret.bits 256 Number of bits of randomness in the generated secret for communication between Hive client and remote Spark driver. Rounded down to the nearest multiple of 8. hive.spark.client.rpc.threads 8 Maximum number of threads for remote Spark driver's RPC event loop. And other settings as well That was the Hive stuff for your Spark BAU. So there are two distinct things. Now going to Hive itself, you will need to add the correct assembly jar file for Hadoop. These are called spark-assembly-x.y.z-hadoop2.4.0.jar Where x.y.z in this case is 1.3.1 The assembly file is spark-assembly-1.3.1-hadoop2.4.0.jar So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar /usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar And you need to compile spark from source excluding Hadoop dependencies ./make-distribution.sh --name"hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided" So Hive uses spark engine by default If you want to use mr
Monitoring Spark application progress
Hi, I would like to know the approach and tools please to get the full performance for a Spark app running through Spark-shell and Spark-sumbit - Through Spark GUI at 4040? - Through OS utilities top, SAR - Through Java tools like jbuilder etc - Through integration Spark with monitoring tools. Thanks
Spark handling spill overs
Hi, How one can avoid having Spark spill over after filling the node's memory. Thanks
Re: My notes on Spark Performance & Tuning Guide
Hi Dr Mich, I will be very keen to have a look at it and review if possible. Please forward me a copy Thanking you warmly On Thursday, 12 May 2016, 11:08, Mich Talebzadeh wrote: Hi Al,, Following the threads in spark forum, I decided to write up on configuration of Spark including allocation of resources and configuration of driver, executors, threads, execution of Spark apps and general troubleshooting taking into account the allocation of resources for Spark applications and OS tools at the disposal. Since the most widespread configuration as I notice is with "Spark Standalone Mode", I have decided to write these notes starting with Standalone and later on moving to Yarn - Standalone – a simple cluster managerincluded with Spark that makes it easy to set up a cluster. - YARN – the resource manager inHadoop 2. I would appreciate if anyone interested in reading and commenting to get in touch with me directly on mich.talebza...@gmail.com so I can send the write-up for their review and comments. Just to be clear this is not meant to be any commercial proposition or anything like that. As I seem to get involved with members troubleshooting issues and threads on this topic, I thought it is worthwhile writing a note about it to summarise the findings for the benefit of the community. Regards. Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com
Re: Re: How big the spark stream window could be ?
great. So in simplest of forms let us assume that I have a standalone host that runs Spark and receives topics from a source say Kafa. So basically I have one executor, one cache on the node and if my streaming data is too much, I anticipate there will not be execution as I don't have memory. On the other hand if I have enough memory allocated then the application will start. Every batch interval data is refreshed from topic on Kafka so I depending on the size of my windowLength that data will persist in memory for windowLength duration and I will be able to analyze the aggregated data through slidingInterval. However, if my windowLength is too big like this case of 24 hours, the host may not have enough memory to hold that abount of data for 24 hours? So the only option would be to reduce the size of windowLength or reduce the volume of topic? On Monday, 9 May 2016, 10:49, Saisai Shao wrote: Pease see the inline comments. On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar wrote: Thank you. So If I create spark streaming then - The streams will always need to be cached? It cannot be stored in persistent storage You don't need to cache the stream explicitly if you don't have specific requirement, Spark will do it for you depends on different streaming sources (Kafka or socket). - The stream data cached will be distributed among all nodes of Spark among executors - As I understand each Spark worker node has one executor that includes cache. So the streaming data is distributed among these work node caches. For example if I have 4 worker nodes each cache will have a quarter of data (this assumes that cache size among worker nodes is the same.) Ideally, it will distributed evenly across the executors, also this is target for tuning. Normally it depends on several conditions like receiver distribution, partition distribution. The issue raises if the amount of streaming data does not fit into these 4 caches? Will the job crash? On Monday, 9 May 2016, 10:16, Saisai Shao wrote: No, each executor only stores part of data in memory (it depends on how the partition are distributed and how many receivers you have). For WindowedDStream, it will obviously cache the data in memory, from my understanding you don't need to call cache() again. On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar wrote: hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao wrote: It depends on you to write the Spark application, normally if data is already on the persistent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the data. On Mon, May 9, 2016 at 4:43 PM, 李明伟 wrote: Thanks.What if I use batch calculation instead of stream computing? Do I still need that much memory? For example, if the 24 hour data set is 100 GB. Do I also need a 100GB RAM to do the one time batch calculation ? At 2016-05-09 15:14:47, "Saisai Shao" wrote: For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals.A window operator is defined by two parameters -- WindowDuration / WindowsLength - the length of the window- SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding wind
Re: Re: How big the spark stream window could be ?
Thank you. So If I create spark streaming then - The streams will always need to be cached? It cannot be stored in persistent storage - The stream data cached will be distributed among all nodes of Spark among executors - As I understand each Spark worker node has one executor that includes cache. So the streaming data is distributed among these work node caches. For example if I have 4 worker nodes each cache will have a quarter of data (this assumes that cache size among worker nodes is the same.) The issue raises if the amount of streaming data does not fit into these 4 caches? Will the job crash? On Monday, 9 May 2016, 10:16, Saisai Shao wrote: No, each executor only stores part of data in memory (it depends on how the partition are distributed and how many receivers you have). For WindowedDStream, it will obviously cache the data in memory, from my understanding you don't need to call cache() again. On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar wrote: hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao wrote: It depends on you to write the Spark application, normally if data is already on the persistent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the data. On Mon, May 9, 2016 at 4:43 PM, 李明伟 wrote: Thanks.What if I use batch calculation instead of stream computing? Do I still need that much memory? For example, if the 24 hour data set is 100 GB. Do I also need a 100GB RAM to do the one time batch calculation ? At 2016-05-09 15:14:47, "Saisai Shao" wrote: For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals.A window operator is defined by two parameters -- WindowDuration / WindowsLength - the length of the window- SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: How big the spark stream window could be ?
hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao wrote: It depends on you to write the Spark application, normally if data is already on the persistent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the data. On Mon, May 9, 2016 at 4:43 PM, 李明伟 wrote: Thanks.What if I use batch calculation instead of stream computing? Do I still need that much memory? For example, if the 24 hour data set is 100 GB. Do I also need a 100GB RAM to do the one time batch calculation ? At 2016-05-09 15:14:47, "Saisai Shao" wrote: For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals.A window operator is defined by two parameters -- WindowDuration / WindowsLength - the length of the window- SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Defining case class within main method throws "No TypeTag available for Accounts"
Thanks Michael as I gathered for now it is a feature. On Monday, 25 April 2016, 18:36, Michael Armbrust wrote: When you define a class inside of a method, it implicitly has a pointer to the outer scope of the method. Spark doesn't have access to this scope, so this makes it hard (impossible?) for us to construct new instances of that class. So, define your classes that you plan to use with Spark at the top level. On Mon, Apr 25, 2016 at 9:36 AM, Mich Talebzadeh wrote: Hi, I notice buiding with sbt if I define my case class outside of main method like below it works case class Accounts( TransactionDate: String, TransactionType: String, Description: String, Value: Double, Balance: Double, AccountName: String, AccountNumber : String) object Import_nw_10124772 { def main(args: Array[String]) { val conf = new SparkConf(). setAppName("Import_nw_10124772"). setMaster("local[12]"). set("spark.driver.allowMultipleContexts", "true"). set("spark.hadoop.validateOutputSpecs", "false") val sc = new SparkContext(conf) // Create sqlContext based on HiveContext val sqlContext = new HiveContext(sc) import sqlContext.implicits._ val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) println ("\nStarted at"); sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') ").collect.foreach(println) // // Get a DF first based on Databricks CSV libraries ignore column heading because of column called "Type" // val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772") //df.printSchema // val a = df.filter(col("Date") > "").map(p => Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString)) However, if I put that case class with the main method, it throws "No TypeTag available for Accounts" error Apparently when case class is defined inside of the method that it is being used, it is not fully defined at that point. Is this a bug within Spark? Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com
Invoking SparkR from Spark shell
Hi, I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R with Spark. Is there a s hell similar to spark-shell that supports R besides Scala please? Thanks
Re: Spark replacing Hadoop
Hello, Well, Sounds like Andy is implying that Spark can replace Hadoop whereas Mich still believes that HDFS is a keeper? thanks On Thursday, 14 April 2016, 20:40, David Newberger wrote: #yiv4514430231 #yiv4514430231 -- _filtered #yiv4514430231 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv4514430231 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}#yiv4514430231 #yiv4514430231 p.yiv4514430231MsoNormal, #yiv4514430231 li.yiv4514430231MsoNormal, #yiv4514430231 div.yiv4514430231MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv4514430231 a:link, #yiv4514430231 span.yiv4514430231MsoHyperlink {color:blue;text-decoration:underline;}#yiv4514430231 a:visited, #yiv4514430231 span.yiv4514430231MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv4514430231 span.yiv4514430231EmailStyle17 {color:#1F497D;}#yiv4514430231 .yiv4514430231MsoChpDefault {font-size:10.0pt;} _filtered #yiv4514430231 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv4514430231 div.yiv4514430231WordSection1 {}#yiv4514430231 Can we assume your question is “Will Spark replace Hadoop MapReduce?” or do you literally mean replacing the whole of Hadoop? David From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] Sent: Thursday, April 14, 2016 2:13 PM To: User Subject: Spark replacing Hadoop Hi, I hear that some saying that Hadoop is getting old and out of date and will be replaced by Spark! Does this make sense and if so how accurate is it? Best
Spark replacing Hadoop
Hi, I hear that some saying that Hadoop is getting old and out of date and will be replaced by Spark! Does this make sense and if so how accurate is it? Best
Spark GUI, Workers and Executors
On Spark GUI I can see the list of Workers. I always understood that workers are used by executors. What is the relationship between workers and executors please. Is it one to one? Thanks
Copying all Hive tables from Prod to UAT
Hi, Anyone has suggestions how to create and copy Hive and Spark tables from Production to UAT. One way would be to copy table data to external files and then move the external files to a local target directory and populate the tables in target Hive with data. Is there an easier way of doing so? thanks
difference between simple streaming and windows streaming in spark
Is simple streaming mean continuous streaming and windows streaming time window? val ssc = new StreamingContext(sparkConf, Seconds(10)) thanks
Working out SQRT on a list
Hi I like a simple sqrt operation on a list but I don't get the result scala val l = List (1,5,786,25)l: List[Int] = List(1, 5, 786, 25) scala> l.map(x => x * x)res42: List[Int] = List(1, 25, 617796, 625) scala> l.map(x => x * x).sqrt:28: error: value sqrt is not a member of List[Int] l.map(x => x * x).sqrt Any ideas Thanks
Spark process creating and writing to a Hive ORC table
Hello, How feasible is to use Spark to extract csv files and creates and writes the content to an ORC table in a Hive database. Is Parquet file the best (optimum) format to write to HDFS from Spark app. Thanks
Re: Spark and N-tier architecture
Thank you both. So am I correct that Spark fits in within the application tier in N-tier architecture? On Tuesday, 29 March 2016, 23:50, Alexander Pivovarov wrote: Spark is a distributed data processing engine plus distributed in-memory / disk data cache spark-jobserver provides REST API to your spark applications. It allows you to submit jobs to spark and get results in sync or async mode It also can create long running Spark context to cache RDDs in memory with some name (namedRDD) and then use it to serve requests from multiple users. Because RDD is in memory response should be super fast (seconds) https://github.com/spark-jobserver/spark-jobserver On Tue, Mar 29, 2016 at 2:50 PM, Mich Talebzadeh wrote: Interesting question. The most widely used application of N-tier is the traditional three-tier architecture that has been the backbone of Client-server architecture by having presentation layer, application layer and data layer. This is primarily for performance, scalability and maintenance. The most profound changes that Big data space has introduced to N-tier architecture is the concept of horizontal scaling as opposed to the previous tiers that relied on vertical scaling. HDFS is an example of horizontal scaling at the data tier by adding more JBODS to storage. Similarly adding more nodes to Spark cluster should result in better performance. Bear in mind that these tiers are at Logical levels which means that there or may not be so many so many physical layers. For example multiple virtual servers can be hosted on the same physical server. With regard to Spark, it is effectively a powerful query tools that sits in between the presentation layer (say Tableau) and the HDFS or Hive as you alluded. In that sense you can think of Spark as part of the application layer that communicates with the backend via a number of protocols including the standard JDBC. There is rather a blurred vision here whether Spark is a database or query tool. IMO it is a query tool in a sense that Spark by itself does not have its own storage concept or metastore. Thus it relies on others to provide that service. HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 29 March 2016 at 22:07, Ashok Kumar wrote: Experts, One of terms used and I hear is N-tier architecture within Big Data used for availability, performance etc. I also hear that Spark by means of its query engine and in-memory caching fits into middle tier (application layer) with HDFS and Hive may be providing the data tier. Can someone elaborate the role of Spark here. For example A Scala program that we write uses JDBC to talk to databases so in that sense is Spark a middle tier application? I hope that someone can clarify this and if so what would the best practice in using Spark as middle tier and within Big data. Thanks
Spark and N-tier architecture
Experts, One of terms used and I hear is N-tier architecture within Big Data used for availability, performance etc. I also hear that Spark by means of its query engine and in-memory caching fits into middle tier (application layer) with HDFS and Hive may be providing the data tier. Can someone elaborate the role of Spark here. For example A Scala program that we write uses JDBC to talk to databases so in that sense is Spark a middle tier application? I hope that someone can clarify this and if so what would the best practice in using Spark as middle tier and within Big data. Thanks
Re: Databricks fails to read the csv file with blank line at the file header
Thanks a ton sir. Very helpful On Monday, 28 March 2016, 22:36, Mich Talebzadeh wrote: Pretty straight forward #!/bin/ksh DIR="hdfs://:9000/data/stg/accounts/nw/x" # ## Remove the blank header line from the spreadsheets and compress them # echo `date` " ""=== Started Removing blank header line and Compressing all csv FILEs" for FILE in `ls *.csv` do sed '1d' ${FILE} > ${FILE}.tmp mv -f ${FILE}.tmp ${FILE} /usr/bin/bzip2 ${FILE} done # ## Clear out hdfs staging directory # echo `date` " ""=== Started deleting old files from hdfs staging directory ${DIR}" hdfs dfs -rm -r ${DIR}/*.bz2 echo `date` " ""=== Started Putting bz2 fileS to hdfs staging directory ${DIR}" for FILE in `ls *.bz2` do hdfs dfs -copyFromLocal ${FILE} ${DIR} done echo `date` " ""=== Checking that all files are moved to hdfs staging directory" hdfs dfs -ls ${DIR} exit 0HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 28 March 2016 at 22:24, Ashok Kumar wrote: Hello Mich If you accommodate can you please share your approach to steps 1-3 above. Best regards On Sunday, 27 March 2016, 14:53, Mich Talebzadeh wrote: Pretty simple as usual it is a combination of ETL and ELT. Basically csv files are loaded into staging directory on host, compressed before pushing into hdfs - ETL --> Get rid of the header blank line on the csv files - ETL --> Compress the csv files - ETL --> Put the compressed CVF files into hdfs staging directory - ELT --> Use databricks to load the csv files - ELT --> Spark FP to prcess the csv data - ELT --> register it as a temporary table - ELT --> Create an ORC table in a named database in compressed zlib2 format in Hive database - ELT --> Insert/select from temporary table to Hive table So the data is stored in an ORC table and one can do whatever analysis using Spark, Hive etc Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 27 March 2016 at 03:05, Koert Kuipers wrote: To me this is expected behavior that I would not want fixed, but if you look at the recent commits for spark-csv it has one that deals this...On Mar 26, 2016 21:25, "Mich Talebzadeh" wrote: Hi, I have a standard csv file (saved as csv in HDFS) that has first line of blank at the headeras follows [blank line] Date, Type, Description, Value, Balance, Account Name, Account Number[blank line]22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN AE","'638585-60125663", When I read this file using the following standard val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") it crashes. java.util.NoSuchElementException at java.util.ArrayList$Itr.next(ArrayList.java:794) If I go and manually delete the first blank line it works OK val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") df: org.apache.spark.sql.DataFrame = [Date: string, Type: string, Description: string, Value: double, Balance: double, Account Name: string, Account Number: string] I can easily write a shell script to get rid of blank line. I was wondering if databricks does have a flag to get rid of the first blank line in csv file format? P.S. If the file is stored as DOS text file, this problem goes away. Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com
Re: Databricks fails to read the csv file with blank line at the file header
Hello Mich If you accommodate can you please share your approach to steps 1-3 above. Best regards On Sunday, 27 March 2016, 14:53, Mich Talebzadeh wrote: Pretty simple as usual it is a combination of ETL and ELT. Basically csv files are loaded into staging directory on host, compressed before pushing into hdfs - ETL --> Get rid of the header blank line on the csv files - ETL --> Compress the csv files - ETL --> Put the compressed CVF files into hdfs staging directory - ELT --> Use databricks to load the csv files - ELT --> Spark FP to prcess the csv data - ELT --> register it as a temporary table - ELT --> Create an ORC table in a named database in compressed zlib2 format in Hive database - ELT --> Insert/select from temporary table to Hive table So the data is stored in an ORC table and one can do whatever analysis using Spark, Hive etc Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 27 March 2016 at 03:05, Koert Kuipers wrote: To me this is expected behavior that I would not want fixed, but if you look at the recent commits for spark-csv it has one that deals this...On Mar 26, 2016 21:25, "Mich Talebzadeh" wrote: Hi, I have a standard csv file (saved as csv in HDFS) that has first line of blank at the headeras follows [blank line] Date, Type, Description, Value, Balance, Account Name, Account Number[blank line]22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN AE","'638585-60125663", When I read this file using the following standard val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") it crashes. java.util.NoSuchElementException at java.util.ArrayList$Itr.next(ArrayList.java:794) If I go and manually delete the first blank line it works OK val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") df: org.apache.spark.sql.DataFrame = [Date: string, Type: string, Description: string, Value: double, Balance: double, Account Name: string, Account Number: string] I can easily write a shell script to get rid of blank line. I was wondering if databricks does have a flag to get rid of the first blank line in csv file format? P.S. If the file is stored as DOS text file, this problem goes away. Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com
Re: Finding out the time a table was created
1.5.2 Ted. Those two lines I don't know where they come. It finds and gets the table info OK HTH On Friday, 25 March 2016, 22:32, Ted Yu wrote: Which release of Spark do you use, Mich ? In master branch, the message is more accurate (sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala): override def getMessage: String = s"Table $table not found in database $db" On Fri, Mar 25, 2016 at 3:21 PM, Mich Talebzadeh wrote: You can use DESCRIBE FORMATTED . to get that info. This is based on the same command in Hive however, it throws two erroneous error lines as shown below (don't see them in Hive DESCRIBE ...) Example scala> sql("describe formatted test.t14").collect.foreach(println) 16/03/25 22:32:38 ERROR Hive: Table test not found: test.test table not found 16/03/25 22:32:38 ERROR Hive: Table test not found: test.test table not found [# col_name data_type comment ] [ ] [invoicenumber int ] [paymentdate date ] [net decimal(20,2) ] [vat decimal(20,2) ] [total decimal(20,2) ] [ ] [# Detailed Table Information ] [Database: test ] [Owner: hduser ] [CreateTime: Fri Mar 25 22:13:44 GMT 2016 ] [LastAccessTime: UNKNOWN ] [Protect Mode: None ] [Retention: 0 ] [Location: hdfs://rhes564:9000/user/hive/warehouse/test.db/t14 ] [Table Type: MANAGED_TABLE ] [Table Parameters: ] [ COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}] [ comment from csv file from excel sheet] [ numFiles 2 ] [ orc.compress ZLIB ] [ totalSize 1090 ] [ transient_lastDdlTime 1458944025 ] [ ] [# Storage Information ] [SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde ] [InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat ] [OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat ] [Compressed: No ] [Num Buckets: -1 ] [Bucket Columns: [] ] [Sort Columns: [] ] [Storage Desc Params: ] [ serialization.format 1 ] HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 25 March 2016 at 22:12, Ashok Kumar wrote: Experts, I would like to know when a table was created in Hive database using Spark shell? Thanks
Finding out the time a table was created
Experts, I would like to know when a table was created in Hive database using Spark shell? Thanks
Re: calling individual columns from spark temporary table
Thank you again For val r = df.filter(col("paid") > "").map(x => (x.getString(0),x.getString(1).) Can you give an example of column expression please like df.filter(col("paid") > "").col("firstcolumn").getString ? On Thursday, 24 March 2016, 0:45, Michael Armbrust wrote: You can only use as on a Column expression, not inside of a lambda function. The reason is the lambda function is compiled into opaque bytecode that Spark SQL is not able to see. We just blindly execute it. However, there are a couple of ways to name the columns that come out of a map. Either use a case class instead of a tuple. Or use .toDF("name1", "name2") after the map. >From a performance perspective, its even better though if you can avoid maps >and stick to Column expressions. The reason is that for maps, we have to >actually materialize and object to pass to your function. However, if you >stick to column expression we can actually work directly on serialized data. On Wed, Mar 23, 2016 at 5:27 PM, Ashok Kumar wrote: thank you sir sql("select `_1` as firstcolumn from items") is there anyway one can keep the csv column names using databricks when mapping val r = df.filter(col("paid") > "").map(x => (x.getString(0),x.getString(1).) can I call example x.getString(0).as.(firstcolumn) in above when mapping if possible so columns will have labels On Thursday, 24 March 2016, 0:18, Michael Armbrust wrote: You probably need to use `backticks` to escape `_1` since I don't think that its a valid SQL identifier. On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar wrote: Gurus, If I register a temporary table as below r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] r.toDF.registerTempTable("items") sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] Is there anyway I can do a select on the first column only sql("select _1 from items" throws error Thanking you
Re: calling individual columns from spark temporary table
thank you sir sql("select `_1` as firstcolumn from items") is there anyway one can keep the csv column names using databricks when mapping val r = df.filter(col("paid") > "").map(x => (x.getString(0),x.getString(1).) can I call example x.getString(0).as.(firstcolumn) in above when mapping if possible so columns will have labels On Thursday, 24 March 2016, 0:18, Michael Armbrust wrote: You probably need to use `backticks` to escape `_1` since I don't think that its a valid SQL identifier. On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar wrote: Gurus, If I register a temporary table as below r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] r.toDF.registerTempTable("items") sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] Is there anyway I can do a select on the first column only sql("select _1 from items" throws error Thanking you
calling individual columns from spark temporary table
Gurus, If I register a temporary table as below r.toDFres58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] r.toDF.registerTempTable("items") sql("select * from items")res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3: double, _4: double, _5: double] Is there anyway I can do a select on the first column only sql("select _1 from items" throws error Thanking you
reading csv file, operation on column or columns
Gurus, I would like to read a csv file into a Data Frame but able to rename the column name, change a column type from String to Integer or drop the column from further analysis before saving data as parquet file? Thanks
Setting up spark to run on two nodes
Experts. Please your valued advice. I have spark 1.5.2 set up as standalone for now and I have started the master as below start-master.sh I also have modified config/slave file to have # A Spark Worker will be started on each of the machines listed below. localhostworkerhost On the localhost I start slave as follows: start-slave.shspark:localhost:7077 Questions. If I want worker process to be started not only on localhost but also workerhost 1) Do I need just to do start-slave.sh on localhost and it will start the worker process on other node -> workerhost2) Do I have to runt start-slave.sh spark:workerhost:7077 as well locally on workerhost3) On GUI http://localhost:4040/environment/ I do not see any reference to worker process running on workerhost Appreciate any help on how to go about starting the master on localhost and starting two workers one on localhost and the other on workerhost Thanking you
shuffle in spark
experts, please I need to understand how shuffling works in Spark and which parameters influence it. I am sorry but my knowledge of shuffling is very limited. Need a practical use case if you can. regards
Spark configuration with 5 nodes
Hi, We intend to use 5servers which will be utilized for building Bigdata Hadoop data warehousesystem (not using any propriety distribution like Hortonworks or Cloudera orothers).All servers configurations are 512GB RAM, 30TB storageand 16 cores, Ubuntu Linux servers. Hadoop will be installed on all theservers/nodes. Server 1 will be used for NameNode plus DataNode as well. Server2 will be used for standby NameNode& DataNode. The rest of the servers will be used as DataNodes..Now we would like to install Spark on each servers tocreate Spark cluster. Is that the good thing to do or we should buy additional hardware for Spark (minding cost here) or simply do we require additionalmemory to accommodate Spark as well please. In that case how much memory for each Spark node would you recommend? thanks all
HBASE
Hi Gurus, I am relatively new to Big Data and know some about Spark and Hive. I was wondering do I need to pick up skills on Hbase as well. I am not sure how it works but know that it is kind of columnar NoSQL database. I know it is good to know something new in Big Data space. Just wondering if I am better off spending efforts on something else please. Appreciate any advice Regards
Re: Converting array to DF
Thanks great val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toSeq.toDF("weights","value").orderBy(desc("value")).collect.foreach(println) On Tuesday, 1 March 2016, 20:52, Shixiong(Ryan) Zhu wrote: For Array, you need to all `toSeq` at first. Scala can convert Array to ArrayOps automatically. However, it's not a `Seq` and you need to call `toSeq` explicitly. On Tue, Mar 1, 2016 at 1:02 AM, Ashok Kumar wrote: Thank you sir This works OKimport sqlContext.implicits._ val weights = Seq(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toDF("weights","value").orderBy(desc("value")).collect.foreach(println) Please why Array did not work? On Tuesday, 1 March 2016, 8:51, Jeff Zhang wrote: Change Array to Seq and import sqlContext.implicits._ On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar wrote: Hi, I have this val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toDF("weights","value") I want to convert the Array to DF but I get thisor weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), (g,6)) :33: error: value toDF is not a member of Array[(String, Int)] weights.toDF("weights","value") I want to label columns and print out the contents in value order please I don't know why I am getting this error Thanks -- Best Regards Jeff Zhang
Re: Converting array to DF
Thank you sir This works OKimport sqlContext.implicits._ val weights = Seq(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toDF("weights","value").orderBy(desc("value")).collect.foreach(println) Please why Array did not work? On Tuesday, 1 March 2016, 8:51, Jeff Zhang wrote: Change Array to Seq and import sqlContext.implicits._ On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar wrote: Hi, I have this val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toDF("weights","value") I want to convert the Array to DF but I get thisor weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), (g,6)) :33: error: value toDF is not a member of Array[(String, Int)] weights.toDF("weights","value") I want to label columns and print out the contents in value order please I don't know why I am getting this error Thanks -- Best Regards Jeff Zhang
Converting array to DF
Hi, I have this val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toDF("weights","value") I want to convert the Array to DF but I get thisor weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), (g,6)) :33: error: value toDF is not a member of Array[(String, Int)] weights.toDF("weights","value") I want to label columns and print out the contents in value order please I don't know why I am getting this error Thanks
Re: Recommendation for a good book on Spark, beginner to moderate knowledge
Thank you all for valuable advice. Much appreciated Best On Sunday, 28 February 2016, 21:48, Ashok Kumar wrote: Hi Gurus, Appreciate if you recommend me a good book on Spark or documentation for beginner to moderate knowledge I very much like to skill myself on transformation and action methods. FYI, I have already looked at examples on net. However, some of them not clear at least to me. Warmest regards
Recommendation for a good book on Spark, beginner to moderate knowledge
Hi Gurus, Appreciate if you recommend me a good book on Spark or documentation for beginner to moderate knowledge I very much like to skill myself on transformation and action methods. FYI, I have already looked at examples on net. However, some of them not clear at least to me. Warmest regards
Re: Ordering two dimensional arrays of (String, Int) in the order of second element
no particular reason. just wanted to know if there was another way as well. thanks On Saturday, 27 February 2016, 22:12, Yin Yang wrote: Is there particular reason you cannot use temporary table ? Thanks On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar wrote: Thank you sir. Can one do this sorting without using temporary table if possible? Best On Saturday, 27 February 2016, 18:50, Yin Yang wrote: scala> Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT struct(id, b, a) from test order by b")df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct] scala> df.show++|struct(id, b, a)|+----+| [2,foo,a]|| [1,test,b]|++ On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar wrote: Hello, I like to be able to solve this using arrays. I have two dimensional array of (String,Int) with 5 entries say arr("A",20), arr("B",13), arr("C", 18), arr("D",10), arr("E",19) I like to write a small code to order these in the order of highest Int column so I will have arr("A",20), arr("E",19), arr("C",18) What is the best way of doing this using arrays only? Thanks
Re: Ordering two dimensional arrays of (String, Int) in the order of second element
Thank you sir. Can one do this sorting without using temporary table if possible? Best On Saturday, 27 February 2016, 18:50, Yin Yang wrote: scala> Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT struct(id, b, a) from test order by b")df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct] scala> df.show++|struct(id, b, a)|+----+| [2,foo,a]|| [1,test,b]|++ On Sat, Feb 27, 2016 at 10:25 AM, Ashok Kumar wrote: Hello, I like to be able to solve this using arrays. I have two dimensional array of (String,Int) with 5 entries say arr("A",20), arr("B",13), arr("C", 18), arr("D",10), arr("E",19) I like to write a small code to order these in the order of highest Int column so I will have arr("A",20), arr("E",19), arr("C",18) What is the best way of doing this using arrays only? Thanks
Ordering two dimensional arrays of (String, Int) in the order of second element
Hello, I like to be able to solve this using arrays. I have two dimensional array of (String,Int) with 5 entries say arr("A",20), arr("B",13), arr("C", 18), arr("D",10), arr("E",19) I like to write a small code to order these in the order of highest Int column so I will have arr("A",20), arr("E",19), arr("C",18) What is the best way of doing this using arrays only? Thanks
Clarification on RDD
Hi, Spark doco says Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs example: val textFile = sc.textFile("README.md") my question is when RDD is created like above from a file stored on HDFS, does that mean that data is distributed among all the nodes in the cluster or data from the md file is copied to each node of the cluster so each node has complete copy of data? Has the data is actually moved around or data is not copied over until an action like COUNT() is performed on RDD? Thanks
d.filter("id in max(id)")
Hi, How can I make that work? val d = HiveContext.table("table") select * from table where ID = MAX(ID) from table Thanks
select * from mytable where column1 in (select max(column1) from mytable)
Hi, What is the equivalent of this in Spark please select * from mytable where column1 in (select max(column1) from mytable) Thanks
Filter on a column having multiple values
Hi, I would like to do the following select count(*) from where column1 in (1,5)) I define scala> var t = HiveContext.table("table") This workst.filter($"column1" ===1) How can I expand this to have column1 for both 1 and 5 please? thanks
Re: Execution plan in spark
looks useful thanks On Wednesday, 24 February 2016, 9:42, Yin Yang wrote: Is the following what you were looking for ? sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tmp/partitioned' )""") table("partitionedParquet").explain(true) On Wed, Feb 24, 2016 at 1:16 AM, Ashok Kumar wrote: Gurus, Is there anything like explain in Spark to see the execution plan in functional programming? warm regards
Execution plan in spark
Gurus, Is there anything like explain in Spark to see the execution plan in functional programming? warm regards
Re: install databricks csv package for spark
great thank you On Friday, 19 February 2016, 15:33, Holden Karau wrote: So with --packages to spark-shell and spark-submit Spark will automatically fetch the requirements from maven. If you want to use an explicit local jar you can do that with the --jars syntax. You might find http://spark.apache.org/docs/latest/submitting-applications.html useful. On Fri, Feb 19, 2016 at 7:26 AM, Ashok Kumar wrote: Hi, I downloaded the zipped csv libraries from databricks/spark-csv | | | | | | | | | | | databricks/spark-csvspark-csv - CSV data source for Spark SQL and DataFrames | | | | View on github.com | Preview by Yahoo | | | | | Now I have a directory created called spark-csv-master. I would like to use this in spark-shell with ---packgage like below $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 Do I need to use mvn to create a zipped file to start. or may be added to spark CLASSPATH. What is needful here please to make it work thanks -- Cell : 425-233-8271Twitter: https://twitter.com/holdenkarau
install databricks csv package for spark
Hi, I downloaded the zipped csv libraries from databricks/spark-csv | | | | | | | | | | | databricks/spark-csvspark-csv - CSV data source for Spark SQL and DataFrames | | | | View on github.com | Preview by Yahoo | | | | | Now I have a directory created called spark-csv-master. I would like to use this in spark-shell with ---packgage like below $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 Do I need to use mvn to create a zipped file to start. or may be added to spark CLASSPATH. What is needful here please to make it work thanks