Re: Spark Newbie question
Thanks Jerry for the clarification. Ajay. On Thu, Jul 11, 2019 at 12:48 PM Jerry Vinokurov wrote: > Hi Ajay, > > When a Spark SQL statement references a table, that table has to be > "registered" first. Usually the way this is done is by reading in a > DataFrame, then calling the createOrReplaceTempView (or one of a few other > functions) on that data frame, with the argument being the name under which > you'd like to register that table. You can then use the table in SQL > statements. As far as I know, you cannot directly refer to any external > data store without reading it in first. > > Jerry > > On Thu, Jul 11, 2019 at 1:27 PM infa elance wrote: > >> Sorry, i guess i hit the send button too soon >> >> This question is regarding a spark stand-alone cluster. My understanding >> is spark is an execution engine and not a storage layer. >> Spark processes data in memory but when someone refers to a spark table >> created through sparksql(df/rdd) what exactly are they referring to? >> >> Could it be a Hive table? If yes, is it the same hive store that spark >> uses? >> Is it a table in memory? If yes, how can an external app access this >> in-memory table? if JDBC what driver ? >> >> On a databricks cluster -- could they be referring spark table created >> through sparksql(df/rdd) as hive or deltalake table? >> >> Spark version with hadoop : spark-2.0.2-bin-hadoop2.7 >> >> Thanks and appreciate your help!! >> Ajay. >> >> >> >> On Thu, Jul 11, 2019 at 12:19 PM infa elance >> wrote: >> >>> This is stand-alone spark cluster. My understanding is spark is an >>> execution engine and not a storage layer. >>> Spark processes data in memory but when someone refers to a spark table >>> created through sparksql(df/rdd) what exactly are they referring to? >>> >>> Could it be a Hive table? If yes, is it the same hive store that spark >>> uses? >>> Is it a table in memory? If yes, how can an external app >>> >>> Spark version with hadoop : spark-2.0.2-bin-hadoop2.7 >>> >>> Thanks and appreciate your help!! >>> Ajay. >>> >> > > -- > http://www.google.com/profiles/grapesmoker >
Re: Spark Newbie question
Hi Ajay, When a Spark SQL statement references a table, that table has to be "registered" first. Usually the way this is done is by reading in a DataFrame, then calling the createOrReplaceTempView (or one of a few other functions) on that data frame, with the argument being the name under which you'd like to register that table. You can then use the table in SQL statements. As far as I know, you cannot directly refer to any external data store without reading it in first. Jerry On Thu, Jul 11, 2019 at 1:27 PM infa elance wrote: > Sorry, i guess i hit the send button too soon > > This question is regarding a spark stand-alone cluster. My understanding > is spark is an execution engine and not a storage layer. > Spark processes data in memory but when someone refers to a spark table > created through sparksql(df/rdd) what exactly are they referring to? > > Could it be a Hive table? If yes, is it the same hive store that spark > uses? > Is it a table in memory? If yes, how can an external app access this > in-memory table? if JDBC what driver ? > > On a databricks cluster -- could they be referring spark table created > through sparksql(df/rdd) as hive or deltalake table? > > Spark version with hadoop : spark-2.0.2-bin-hadoop2.7 > > Thanks and appreciate your help!! > Ajay. > > > > On Thu, Jul 11, 2019 at 12:19 PM infa elance > wrote: > >> This is stand-alone spark cluster. My understanding is spark is an >> execution engine and not a storage layer. >> Spark processes data in memory but when someone refers to a spark table >> created through sparksql(df/rdd) what exactly are they referring to? >> >> Could it be a Hive table? If yes, is it the same hive store that spark >> uses? >> Is it a table in memory? If yes, how can an external app >> >> Spark version with hadoop : spark-2.0.2-bin-hadoop2.7 >> >> Thanks and appreciate your help!! >> Ajay. >> > -- http://www.google.com/profiles/grapesmoker
Re: Spark Newbie question
Sorry, i guess i hit the send button too soon This question is regarding a spark stand-alone cluster. My understanding is spark is an execution engine and not a storage layer. Spark processes data in memory but when someone refers to a spark table created through sparksql(df/rdd) what exactly are they referring to? Could it be a Hive table? If yes, is it the same hive store that spark uses? Is it a table in memory? If yes, how can an external app access this in-memory table? if JDBC what driver ? On a databricks cluster -- could they be referring spark table created through sparksql(df/rdd) as hive or deltalake table? Spark version with hadoop : spark-2.0.2-bin-hadoop2.7 Thanks and appreciate your help!! Ajay. On Thu, Jul 11, 2019 at 12:19 PM infa elance wrote: > This is stand-alone spark cluster. My understanding is spark is an > execution engine and not a storage layer. > Spark processes data in memory but when someone refers to a spark table > created through sparksql(df/rdd) what exactly are they referring to? > > Could it be a Hive table? If yes, is it the same hive store that spark > uses? > Is it a table in memory? If yes, how can an external app > > Spark version with hadoop : spark-2.0.2-bin-hadoop2.7 > > Thanks and appreciate your help!! > Ajay. >
Spark Newbie question
This is stand-alone spark cluster. My understanding is spark is an execution engine and not a storage layer. Spark processes data in memory but when someone refers to a spark table created through sparksql(df/rdd) what exactly are they referring to? Could it be a Hive table? If yes, is it the same hive store that spark uses? Is it a table in memory? If yes, how can an external app Spark version with hadoop : spark-2.0.2-bin-hadoop2.7 Thanks and appreciate your help!! Ajay.
Re: A spark newbie question
val sconf = new SparkConf().setMaster("local").setAppName("MedicalSideFx-CassandraLogsMessageTypeCount") val sc = new SparkContext(sconf)val inputDir = "/path/to/cassandralogs.txt" sc.textFile(inputDir).map(line => line.replace("\"", "")).map(line => (line.split(' ')(0) + " " + line.split(' ')(2), 1)).reduceByKey((v1,v2) => v1+v2).collect().foreach(println) If u want to save the file ==val outDir = "/path/to/output/dir/cassandra_logs" var outFile = outDir+"/"+"sparkout_" + System.currentTimeMillis sc.textFile(inputDir).map(line => line.replace("\"", "")).map(line => (line.split(' ')(0) + " " + line.split(' ')(2), 1)).reduceByKey((v1,v2) => v1+v2).saveToTextFile(outFile) The code is here (not elegant :-) but works) https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/CassandraLogsMessageTypeCount.scala OUTPUT===(2014-06-27 PAUSE,1)(2014-06-27 START,2)(2014-06-27 STOP,1)(2014-06-25 STOP,1)(2014-06-27 RESTART,1)(2014-06-27 REWIND,2)(2014-06-25 START,3)(2014-06-25 PAUSE,1) hope this helps. Since u r new to Spark , it may help to learn using an IDE. I use IntelliJ and have many examples posted here.https://github.com/sanjaysubramanian/msfx_scala.git These are simple silly examples of my learning process :-) Plus IMHO , if u r planning on learning Spark, I would say YES to Scala and NO to Java. Yes its a diff paradigm but being a Java and Hadoop programmer for many years, I am excited to learn Scala as the language and use Spark. Its exciting. regards sanjay From: Aniket Bhatnagar To: Dinesh Vallabhdas ; "user@spark.apache.org" Sent: Sunday, January 4, 2015 11:07 AM Subject: Re: A spark newbie question Go through spark API documentation. Basically you have to do group by (date, message_type) and then do a count. On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas wrote: A spark cassandra newbie question. Thanks in advance for the help.I have a cassandra table with 2 columns message_timestamp(timestamp) and message_type(text). The data is of the form2014-06-25 12:01:39 "START" 2014-06-25 12:02:39 "START" 2014-06-25 12:02:39 "PAUSE" 2014-06-25 14:02:39 "STOP" 2014-06-25 15:02:39 "START" 2014-06-27 12:01:39 "START" 2014-06-27 11:03:39 "STOP" 2014-06-27 12:03:39 "REWIND" 2014-06-27 12:04:39 "RESTART" 2014-06-27 12:05:39 "PAUSE" 2014-06-27 13:03:39 "REWIND" 2014-06-27 14:03:39 "START" I want to use spark(using java) to calculate counts of a message_type on a per day basis and store it back in cassandra in a new table with 3 columns (date,message_type,count).The result table should look like this2014-06-25 START 3 2014-06-25 STOP 1 2014-06-25 PAUSE 1 2014-06-27 START 2 2014-06-27 STOP 1 2014-06-27 PAUSE 1 2014-06-27 REWIND 2 2014-06-27 RESTART 1 I'm not proficient in scala and would like to use java.
Re: A spark newbie question
Go through spark API documentation. Basically you have to do group by (date, message_type) and then do a count. On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas wrote: > A spark cassandra newbie question. Thanks in advance for the help. > I have a cassandra table with 2 columns message_timestamp(timestamp) and > message_type(text). The data is of the form > > 2014-06-25 12:01:39 "START" > 2014-06-25 12:02:39 "START" > 2014-06-25 12:02:39 "PAUSE" > 2014-06-25 14:02:39 "STOP" > 2014-06-25 15:02:39 "START" > 2014-06-27 12:01:39 "START" > 2014-06-27 11:03:39 "STOP" > 2014-06-27 12:03:39 "REWIND" > 2014-06-27 12:04:39 "RESTART" > 2014-06-27 12:05:39 "PAUSE" > 2014-06-27 13:03:39 "REWIND" > 2014-06-27 14:03:39 "START" > > I want to use spark(using java) to calculate counts of a message_type on a > per day basis and store it back in cassandra in a new table with 3 columns ( > date,message_type,count). > The result table should look like this > > 2014-06-25 START 3 > 2014-06-25 STOP 1 > 2014-06-25 PAUSE 1 > 2014-06-27 START 2 > 2014-06-27 STOP 1 > 2014-06-27 PAUSE 1 > 2014-06-27 REWIND 2 > 2014-06-27 RESTART 1 > > I'm not proficient in scala and would like to use java. > > >
A spark newbie question on summary statistics
A spark cassandra newbie question. Appreciate the help.u...@host.com I have a cassandra table with 2 columns message_timestamp(timestamp) and message_type(text). The data is of the form 2014-06-25 12:01:39 "START" 2014-06-25 12:02:39 "START" 2014-06-25 12:02:39 "PAUSE" 2014-06-25 14:02:39 "STOP" 2014-06-25 15:02:39 "START" 2014-06-27 12:01:39 "START" 2014-06-27 11:03:39 "STOP" 2014-06-27 12:03:39 "REWIND" 2014-06-27 12:04:39 "RESTART" 2014-06-27 12:05:39 "PAUSE" 2014-06-27 13:03:39 "REWIND" 2014-06-27 14:03:39 "START" I want to use spark(using java) to calculate counts of a message_type on a per day basis and store it back in cassandra in a new table with 3 columns (date,message_type,count). The result table should look like this 2014-06-25 START 3 2014-06-25 STOP 1 2014-06-25 PAUSE 1 2014-06-27 START 2 2014-06-27 STOP 1 2014-06-27 PAUSE 1 2014-06-27 REWIND 2 2014-06-27 RESTART 1 I'm not proficient in scala and would like to use java. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-spark-newbie-question-on-summary-statistics-tp20962.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
A spark newbie question
A spark cassandra newbie question. Thanks in advance for the help.I have a cassandra table with 2 columns message_timestamp(timestamp) and message_type(text). The data is of the form2014-06-25 12:01:39 "START" 2014-06-25 12:02:39 "START" 2014-06-25 12:02:39 "PAUSE" 2014-06-25 14:02:39 "STOP" 2014-06-25 15:02:39 "START" 2014-06-27 12:01:39 "START" 2014-06-27 11:03:39 "STOP" 2014-06-27 12:03:39 "REWIND" 2014-06-27 12:04:39 "RESTART" 2014-06-27 12:05:39 "PAUSE" 2014-06-27 13:03:39 "REWIND" 2014-06-27 14:03:39 "START" I want to use spark(using java) to calculate counts of a message_type on a per day basis and store it back in cassandra in a new table with 3 columns (date,message_type,count).The result table should look like this2014-06-25 START 3 2014-06-25 STOP 1 2014-06-25 PAUSE 1 2014-06-27 START 2 2014-06-27 STOP 1 2014-06-27 PAUSE 1 2014-06-27 REWIND 2 2014-06-27 RESTART 1 I'm not proficient in scala and would like to use java.