Re: Spark Newbie question

2019-07-11 Thread infa elance
Thanks Jerry for the clarification.

Ajay.


On Thu, Jul 11, 2019 at 12:48 PM Jerry Vinokurov 
wrote:

> Hi Ajay,
>
> When a Spark SQL statement references a table, that table has to be
> "registered" first. Usually the way this is done is by reading in a
> DataFrame, then calling the createOrReplaceTempView (or one of a few other
> functions) on that data frame, with the argument being the name under which
> you'd like to register that table. You can then use the table in SQL
> statements. As far as I know, you cannot directly refer to any external
> data store without reading it in first.
>
> Jerry
>
> On Thu, Jul 11, 2019 at 1:27 PM infa elance  wrote:
>
>> Sorry, i guess i hit the send button too soon
>>
>> This question is regarding a spark stand-alone cluster. My understanding
>> is spark is an execution engine and not a storage layer.
>> Spark processes data in memory but when someone refers to a spark table
>> created through sparksql(df/rdd) what exactly are they referring to?
>>
>> Could it be a Hive table? If yes, is it the same hive store that spark
>> uses?
>> Is it a table in memory? If yes, how can an external app access this
>> in-memory table? if JDBC what driver ?
>>
>> On a databricks cluster -- could they be referring spark table created
>> through sparksql(df/rdd) as hive or deltalake table?
>>
>> Spark version with hadoop : spark-2.0.2-bin-hadoop2.7
>>
>> Thanks and appreciate your help!!
>> Ajay.
>>
>>
>>
>> On Thu, Jul 11, 2019 at 12:19 PM infa elance 
>> wrote:
>>
>>> This is stand-alone spark cluster. My understanding is spark is an
>>> execution engine and not a storage layer.
>>> Spark processes data in memory but when someone refers to a spark table
>>> created through sparksql(df/rdd) what exactly are they referring to?
>>>
>>> Could it be a Hive table? If yes, is it the same hive store that spark
>>> uses?
>>> Is it a table in memory? If yes, how can an external app
>>>
>>> Spark version with hadoop : spark-2.0.2-bin-hadoop2.7
>>>
>>> Thanks and appreciate your help!!
>>> Ajay.
>>>
>>
>
> --
> http://www.google.com/profiles/grapesmoker
>


Re: Spark Newbie question

2019-07-11 Thread Jerry Vinokurov
Hi Ajay,

When a Spark SQL statement references a table, that table has to be
"registered" first. Usually the way this is done is by reading in a
DataFrame, then calling the createOrReplaceTempView (or one of a few other
functions) on that data frame, with the argument being the name under which
you'd like to register that table. You can then use the table in SQL
statements. As far as I know, you cannot directly refer to any external
data store without reading it in first.

Jerry

On Thu, Jul 11, 2019 at 1:27 PM infa elance  wrote:

> Sorry, i guess i hit the send button too soon
>
> This question is regarding a spark stand-alone cluster. My understanding
> is spark is an execution engine and not a storage layer.
> Spark processes data in memory but when someone refers to a spark table
> created through sparksql(df/rdd) what exactly are they referring to?
>
> Could it be a Hive table? If yes, is it the same hive store that spark
> uses?
> Is it a table in memory? If yes, how can an external app access this
> in-memory table? if JDBC what driver ?
>
> On a databricks cluster -- could they be referring spark table created
> through sparksql(df/rdd) as hive or deltalake table?
>
> Spark version with hadoop : spark-2.0.2-bin-hadoop2.7
>
> Thanks and appreciate your help!!
> Ajay.
>
>
>
> On Thu, Jul 11, 2019 at 12:19 PM infa elance 
> wrote:
>
>> This is stand-alone spark cluster. My understanding is spark is an
>> execution engine and not a storage layer.
>> Spark processes data in memory but when someone refers to a spark table
>> created through sparksql(df/rdd) what exactly are they referring to?
>>
>> Could it be a Hive table? If yes, is it the same hive store that spark
>> uses?
>> Is it a table in memory? If yes, how can an external app
>>
>> Spark version with hadoop : spark-2.0.2-bin-hadoop2.7
>>
>> Thanks and appreciate your help!!
>> Ajay.
>>
>

-- 
http://www.google.com/profiles/grapesmoker


Re: Spark Newbie question

2019-07-11 Thread infa elance
Sorry, i guess i hit the send button too soon

This question is regarding a spark stand-alone cluster. My understanding is
spark is an execution engine and not a storage layer.
Spark processes data in memory but when someone refers to a spark table
created through sparksql(df/rdd) what exactly are they referring to?

Could it be a Hive table? If yes, is it the same hive store that spark uses?
Is it a table in memory? If yes, how can an external app access this
in-memory table? if JDBC what driver ?

On a databricks cluster -- could they be referring spark table created
through sparksql(df/rdd) as hive or deltalake table?

Spark version with hadoop : spark-2.0.2-bin-hadoop2.7

Thanks and appreciate your help!!
Ajay.



On Thu, Jul 11, 2019 at 12:19 PM infa elance  wrote:

> This is stand-alone spark cluster. My understanding is spark is an
> execution engine and not a storage layer.
> Spark processes data in memory but when someone refers to a spark table
> created through sparksql(df/rdd) what exactly are they referring to?
>
> Could it be a Hive table? If yes, is it the same hive store that spark
> uses?
> Is it a table in memory? If yes, how can an external app
>
> Spark version with hadoop : spark-2.0.2-bin-hadoop2.7
>
> Thanks and appreciate your help!!
> Ajay.
>


Spark Newbie question

2019-07-11 Thread infa elance
This is stand-alone spark cluster. My understanding is spark is an
execution engine and not a storage layer.
Spark processes data in memory but when someone refers to a spark table
created through sparksql(df/rdd) what exactly are they referring to?

Could it be a Hive table? If yes, is it the same hive store that spark uses?
Is it a table in memory? If yes, how can an external app

Spark version with hadoop : spark-2.0.2-bin-hadoop2.7

Thanks and appreciate your help!!
Ajay.


Re: A spark newbie question

2015-01-04 Thread Sanjay Subramanian
val sconf = new 
SparkConf().setMaster("local").setAppName("MedicalSideFx-CassandraLogsMessageTypeCount")
val sc = new SparkContext(sconf)val inputDir = "/path/to/cassandralogs.txt"

sc.textFile(inputDir).map(line => line.replace("\"", "")).map(line => 
(line.split(' ')(0) + " " + line.split(' ')(2), 1)).reduceByKey((v1,v2) => 
v1+v2).collect().foreach(println)
If u want to save the file ==val outDir = 
"/path/to/output/dir/cassandra_logs"
var outFile = outDir+"/"+"sparkout_" + System.currentTimeMillis

sc.textFile(inputDir).map(line => line.replace("\"", "")).map(line => 
(line.split(' ')(0) + " " + line.split(' ')(2), 1)).reduceByKey((v1,v2) => 
v1+v2).saveToTextFile(outFile)
The code is here (not elegant :-) but works) 
https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/CassandraLogsMessageTypeCount.scala
OUTPUT===(2014-06-27 PAUSE,1)(2014-06-27 START,2)(2014-06-27 
STOP,1)(2014-06-25 STOP,1)(2014-06-27 RESTART,1)(2014-06-27 
REWIND,2)(2014-06-25 START,3)(2014-06-25 PAUSE,1)
hope this helps. 
Since u r new to Spark , it may help to learn using an IDE. I use IntelliJ and 
have many examples posted 
here.https://github.com/sanjaysubramanian/msfx_scala.git 
These are simple silly examples of my learning process :-)
Plus IMHO , if u r planning on learning Spark, I would say YES to Scala and NO 
to Java. Yes its a diff paradigm but being a Java and Hadoop programmer for 
many years, I am excited to learn Scala as the language and use Spark. Its 
exciting.  
regards
sanjay
  From: Aniket Bhatnagar 
 To: Dinesh Vallabhdas ; "user@spark.apache.org" 
 
 Sent: Sunday, January 4, 2015 11:07 AM
 Subject: Re: A spark newbie question
   
Go through spark API documentation. Basically you have to do group by (date, 
message_type) and then do a count. 


On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas  
wrote:

A spark cassandra newbie question. Thanks in advance for the help.I have a 
cassandra table with 2 columns message_timestamp(timestamp) and 
message_type(text). The data is of the form2014-06-25 12:01:39 "START"
2014-06-25 12:02:39 "START"
2014-06-25 12:02:39 "PAUSE"
2014-06-25 14:02:39 "STOP"
2014-06-25 15:02:39 "START"
2014-06-27 12:01:39 "START"
2014-06-27 11:03:39 "STOP"
2014-06-27 12:03:39 "REWIND"
2014-06-27 12:04:39 "RESTART"
2014-06-27 12:05:39 "PAUSE"
2014-06-27 13:03:39 "REWIND"
2014-06-27 14:03:39 "START"
I want to use spark(using java) to calculate counts of a message_type on a per 
day basis and store it back in cassandra in a new table with 3 columns 
(date,message_type,count).The result table should look like this2014-06-25 
START 3
2014-06-25 STOP 1
2014-06-25 PAUSE 1
2014-06-27 START 2
2014-06-27 STOP 1
2014-06-27 PAUSE 1
2014-06-27 REWIND 2
2014-06-27 RESTART 1
I'm not proficient in scala and would like to use java.




  

Re: A spark newbie question

2015-01-04 Thread Aniket Bhatnagar
Go through spark API documentation. Basically you have to do group by
(date, message_type) and then do a count.

On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas 
wrote:

> A spark cassandra newbie question. Thanks in advance for the help.
> I have a cassandra table with 2 columns message_timestamp(timestamp) and
> message_type(text). The data is of the form
>
> 2014-06-25 12:01:39 "START"
> 2014-06-25 12:02:39 "START"
> 2014-06-25 12:02:39 "PAUSE"
> 2014-06-25 14:02:39 "STOP"
> 2014-06-25 15:02:39 "START"
> 2014-06-27 12:01:39 "START"
> 2014-06-27 11:03:39 "STOP"
> 2014-06-27 12:03:39 "REWIND"
> 2014-06-27 12:04:39 "RESTART"
> 2014-06-27 12:05:39 "PAUSE"
> 2014-06-27 13:03:39 "REWIND"
> 2014-06-27 14:03:39 "START"
>
> I want to use spark(using java) to calculate counts of a message_type on a
> per day basis and store it back in cassandra in a new table with 3 columns (
> date,message_type,count).
> The result table should look like this
>
> 2014-06-25 START 3
> 2014-06-25 STOP 1
> 2014-06-25 PAUSE 1
> 2014-06-27 START 2
> 2014-06-27 STOP 1
> 2014-06-27 PAUSE 1
> 2014-06-27 REWIND 2
> 2014-06-27 RESTART 1
>
> I'm not proficient in scala and would like to use java.
>
>
>


A spark newbie question on summary statistics

2015-01-04 Thread anondin
A spark cassandra newbie question. Appreciate the help.u...@host.com

I have a cassandra table with 2 columns message_timestamp(timestamp) and
message_type(text). The data is of the form

2014-06-25 12:01:39 "START"
2014-06-25 12:02:39 "START"
2014-06-25 12:02:39 "PAUSE"
2014-06-25 14:02:39 "STOP"
2014-06-25 15:02:39 "START"
2014-06-27 12:01:39 "START"
2014-06-27 11:03:39 "STOP"
2014-06-27 12:03:39 "REWIND"
2014-06-27 12:04:39 "RESTART"
2014-06-27 12:05:39 "PAUSE"
2014-06-27 13:03:39 "REWIND"
2014-06-27 14:03:39 "START"
I want to use spark(using java) to calculate counts of a message_type on a
per day basis and store it back in cassandra in a new table with 3 columns
(date,message_type,count).

The result table should look like this

2014-06-25 START 3
2014-06-25 STOP 1
2014-06-25 PAUSE 1
2014-06-27 START 2
2014-06-27 STOP 1
2014-06-27 PAUSE 1
2014-06-27 REWIND 2
2014-06-27 RESTART 1
I'm not proficient in scala and would like to use java.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/A-spark-newbie-question-on-summary-statistics-tp20962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



A spark newbie question

2015-01-04 Thread Dinesh Vallabhdas
A spark cassandra newbie question. Thanks in advance for the help.I have a 
cassandra table with 2 columns message_timestamp(timestamp) and 
message_type(text). The data is of the form2014-06-25 12:01:39 "START"
2014-06-25 12:02:39 "START"
2014-06-25 12:02:39 "PAUSE"
2014-06-25 14:02:39 "STOP"
2014-06-25 15:02:39 "START"
2014-06-27 12:01:39 "START"
2014-06-27 11:03:39 "STOP"
2014-06-27 12:03:39 "REWIND"
2014-06-27 12:04:39 "RESTART"
2014-06-27 12:05:39 "PAUSE"
2014-06-27 13:03:39 "REWIND"
2014-06-27 14:03:39 "START"
I want to use spark(using java) to calculate counts of a message_type on a per 
day basis and store it back in cassandra in a new table with 3 columns 
(date,message_type,count).The result table should look like this2014-06-25 
START 3
2014-06-25 STOP 1
2014-06-25 PAUSE 1
2014-06-27 START 2
2014-06-27 STOP 1
2014-06-27 PAUSE 1
2014-06-27 REWIND 2
2014-06-27 RESTART 1
I'm not proficient in scala and would like to use java.