Testing spark e-mail list

2017-11-09 Thread David Hodeffi


Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


RE: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread David Hodeffi
Testing Spark group e-mail

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark sql truncate function

2017-10-30 Thread David Hodeffi
I saw that it is possible to truncate date function with MM or YY but it is not 
possible to truncate  by WEEK ,HOUR, MINUTE.
Am I right? Is there any objection to support it or it is just not implemented 
yet.


Thanks David

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


Spark Storage Tab is empty

2016-12-26 Thread David Hodeffi
I have tried the following code but didn't see anything on the storage tab.


val myrdd = sc.parallelilize(1 to 100)
myrdd.setName("my_rdd")
myrdd.cache()
myrdd.collect()

Storage tab is empty, though I can see the stage of collect() .
I am using 1.6.2 ,HDP 2.5 , spark on yarn


Thanks David


Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


Re: SPARK -SQL Understanding BroadcastNestedLoopJoin and number of partitions

2016-12-21 Thread David Hodeffi
Do you know who can I talk to about this code? I am rally curious to know why 
there is a join and why number of partition for join is the sum of both of 
them, I expected to see that number of partitions should be the same as the 
streamed table ,or  worst case multiplied.

Sent from my iPhone

On Dec 21, 2016, at 14:43, David Hodeffi 
<david.hode...@niceactimize.com<mailto:david.hode...@niceactimize.com>> wrote:


I have two dataframes which I am joining. small and big size dataframess. The 
optimizer suggest to use BroadcastNestedLoopJoin.
number of partitions for the big Dataframe is 200 while small Dataframe has 5 
partitions.
The joined dataframe results with 205 partitions (joined.rdd.partitions.size), 
I have tried to understand why is this number and figured out that 
BroadCastNestedLoopJoin is actually a union.
code :
case class BroadcastNestedLoopJoin{
def doExecuteo(): =
{ ... ... sparkContext.union( matchedStreamRows, 
sparkContext.makeRDD(notMatchedBroadcastRows) ) }
}
can someone please explain what exactly the code of doExecute() do? can you 
elaborate about all the null checks and why can we have nulls ? Why do we have 
205 partitions? link to a JIRA with discussion that can explain the code can 
help.


Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



SPARK -SQL Understanding BroadcastNestedLoopJoin and number of partitions

2016-12-21 Thread David Hodeffi

I have two dataframes which I am joining. small and big size dataframess. The 
optimizer suggest to use BroadcastNestedLoopJoin.
number of partitions for the big Dataframe is 200 while small Dataframe has 5 
partitions.
The joined dataframe results with 205 partitions (joined.rdd.partitions.size), 
I have tried to understand why is this number and figured out that 
BroadCastNestedLoopJoin is actually a union.
code :
case class BroadcastNestedLoopJoin{
def doExecuteo(): =
{ ... ... sparkContext.union( matchedStreamRows, 
sparkContext.makeRDD(notMatchedBroadcastRows) ) }
}
can someone please explain what exactly the code of doExecute() do? can you 
elaborate about all the null checks and why can we have nulls ? Why do we have 
205 partitions? link to a JIRA with discussion that can explain the code can 
help.


Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


RE: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread David Hodeffi
I am not familiar of any problem with that.
Anyway, If you run spark applicaction you would have multiple jobs, which makes 
sense that it is not a problem.

Thanks David.

From: Naveen [mailto:hadoopst...@gmail.com]
Sent: Wednesday, December 21, 2016 9:18 AM
To: d...@spark.apache.org; user@spark.apache.org
Subject: Launching multiple spark jobs within a main spark job.

Hi Team,

Is it ok to spawn multiple spark jobs within a main spark job, my main spark 
job's driver which was launched on yarn cluster, will do some preprocessing and 
based on it, it needs to launch multilple spark jobs on yarn cluster. Not sure 
if this right pattern.

Please share your thoughts.
Sample code i ve is as below for better understanding..
-

Object Mainsparkjob {

main(...){

val sc=new SparkContext(..)

Fetch from hive..using hivecontext
Fetch from hbase

//spawning multiple Futures..
Val future1=Future{
Val sparkjob= SparkLauncher(...).launch; spark.waitFor
}

Similarly, future2 to futureN.

future1.onComplete{...}
}

}// end of mainsparkjob
--

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.