Re: MLlib, Java, and DataFrame

2016-07-21 Thread VG
Interesting. thanks for this information.

On Fri, Jul 22, 2016 at 11:26 AM, Bryan Cutler  wrote:

> ML has a DataFrame based API, while MLlib is RDDs and will be deprecated
> as of Spark 2.0.
>
> On Thu, Jul 21, 2016 at 10:41 PM, VG  wrote:
>
>> Why do we have these 2 packages ... ml and mlib?
>> What is the difference in these
>>
>>
>>
>> On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler  wrote:
>>
>>> Hi JG,
>>>
>>> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
>>> DataFrames.  Take a look at this example
>>> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>>>
>>> This example uses a Dataset, which is type equivalent to a
>>> DataFrame.
>>>
>>>
>>> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin 
>>> wrote:
>>>
 Hi,

 I am looking for some really super basic examples of MLlib (like a
 linear regression over a list of values) in Java. I have found a few, but I
 only saw them using JavaRDD... and not DataFrame.

 I was kind of hoping to take my current DataFrame and send them in
 MLlib. Am I too optimistic? Do you know/have any example like that?

 Thanks!

 jg


 Jean Georges Perrin
 j...@jgp.net / @jgperrin





>>>
>>
>


Re: Programmatic use of UDFs from Java

2016-07-21 Thread Bryan Cutler
Everett, I had the same question today and came across this old thread.
Not sure if there has been any more recent work to support this.
http://apache-spark-developers-list.1001551.n3.nabble.com/Using-UDFs-in-Java-without-registration-td12497.html


On Thu, Jul 21, 2016 at 10:10 AM, Everett Anderson  wrote:

> Hi,
>
> In the Java Spark DataFrames API, you can create a UDF, register it, and
> then access it by string name by using the convenience UDF classes in
> org.apache.spark.sql.api.java
> 
> .
>
> Example
>
> UDF1 testUdf1 = new UDF1<>() { ... }
>
> sqlContext.udf().register("testfn", testUdf1, DataTypes.LongType);
>
> DataFrame df2 = df.withColumn("new_col", *functions.callUDF("testfn"*,
> df.col("old_col")));
>
> However, I'd like to avoid registering these by name, if possible, since I
> have many of them and would need to deal with name conflicts.
>
> There are udf() methods like this that seem to be from the Scala API
> ,
> where you don't have to register everything by name first.
>
> However, using those methods from Java would require interacting with
> Scala's scala.reflect.api.TypeTags.TypeTag. I'm having a hard time
> figuring out how to create a TypeTag from Java.
>
> Does anyone have an example of using the udf() methods from Java?
>
> Thanks!
>
> - Everett
>
>


Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
Thanks for response. I am using google cloud . I have couple of options .
1. I can go for spark and run sql queries using sqlcontext .
2. Use hive ,
As I understand , hive will have underlying engine spark . Is that correct
?
Also my data is json and is highly nested .
What do you suggest ?

Thanks
Ashutosh

On Fri, Jul 22, 2016 at 7:35 AM, Gourav Sengupta 
wrote:

> If you are using EMR, please try their latest release, there will be very
> few reasons left for using SPARK ever at all (particularly given that
> hiveContext rides a lot on HIVE) if you are using SQL.
>
> Just over regular csv data I have seen Hive on TEZ performance gains by
> 100x (query 64 million rows x 570 columns in 2.5 mins) , and when using ORC
>  the performance gains are super fast (query 64 million rows x 570 columns
> in 54 seconds) and with proper partitioning and indexing in ORC its blazing
> fast (query 64 million rows x 570 columns in 19 seconds). There is perhaps
> a reason why SPARK makes things slow while using ORC :)
>
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 12:40 PM, Ashutosh Kumar  > wrote:
>
>> It works. Is it better to have hive in this case for better performance ?
>>
>> On Thu, Jul 21, 2016 at 12:30 PM, Simone 
>> wrote:
>>
>>> If you have a folder, and a bunch of json inside that folder- yes it
>>> should work. Just set as path something like "path/to/your/folder/*.json"
>>> All files will be loaded into a dataframe and schema will be the union
>>> of all the different schemas of your json files (only if you have different
>>> schemas)
>>> It should work - let me know
>>>
>>> Simone Miraglia
>>> --
>>> Da: Ashutosh Kumar 
>>> Inviato: ‎21/‎07/‎2016 08:55
>>> A: Simone ; user @spark
>>> 
>>> Oggetto: Re: Reading multiple json files form nested folders for data
>>> frame
>>>
>>> That example points to a particular json file. Will it work same way if
>>> I point to top level folder containing all json files ?
>>>
>>> On Thu, Jul 21, 2016 at 12:04 PM, Simone 
>>> wrote:
>>>
 Yes you can - have a look here
 http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

 Hope it helps

 Simone Miraglia
 --
 Da: Ashutosh Kumar 
 Inviato: ‎21/‎07/‎2016 08:19
 A: user @spark 
 Oggetto: Reading multiple json files form nested folders for data frame

 I need to read bunch of json files kept in date wise folders and
 perform sql queries on them using data frame. Is it possible to do so?
 Please provide some pointers .

 Thanks
 Ashutosh

>>>
>>>
>>
>


Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
ML has a DataFrame based API, while MLlib is RDDs and will be deprecated as
of Spark 2.0.

On Thu, Jul 21, 2016 at 10:41 PM, VG  wrote:

> Why do we have these 2 packages ... ml and mlib?
> What is the difference in these
>
>
>
> On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler  wrote:
>
>> Hi JG,
>>
>> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
>> DataFrames.  Take a look at this example
>> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>>
>> This example uses a Dataset, which is type equivalent to a DataFrame.
>>
>>
>> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin  wrote:
>>
>>> Hi,
>>>
>>> I am looking for some really super basic examples of MLlib (like a
>>> linear regression over a list of values) in Java. I have found a few, but I
>>> only saw them using JavaRDD... and not DataFrame.
>>>
>>> I was kind of hoping to take my current DataFrame and send them in
>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>>
>>> Jean Georges Perrin
>>> j...@jgp.net / @jgperrin
>>>
>>>
>>>
>>>
>>>
>>
>


Re: SparkWebUI and Master URL on EC2

2016-07-21 Thread Ismaël Mejía
Hello,

If you are using EMR you probably need to create a SSH tunnel so you can
access the web ports of the master instance.

https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html

Also verify that your EMR cluster is not behind a private VPC, in this case
you must verify also that you can correctly access the VPC and create the
tunnel or forwards to do so.

Hope it helps,
Ismaël Mejía


On Thu, Jul 21, 2016 at 10:33 PM, Jacek Laskowski  wrote:

> Hi,
>
> What's in the logs of spark-shell? There should be the host and port
> of web UI. what the public IP of the host where you execute
> spark-shell? Use it for 4040. I don't think you use Spark Standalone
> cluster (the other address with 8080) if you simply spark-shell
> (unless you've got spark.master in conf/spark-defaults.conf).
>
> When you're inside spark-shell, can you sc.master? Use this for Spark
> Standalone's web UI (replacing 7077 to 8080).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Jul 20, 2016 at 10:09 PM, KhajaAsmath Mohammed
>  wrote:
> > Hi,
> >
> > I got an access to spark cluser and have intstatiated spark-shell on aws
> > using command
> > $spark-shell.
> >
> > Spark shell is started successfully but I am looking to access WebUI and
> > Master URL. does anyone know how to access that in AWS.
> >
> > I tried http://IPMaster:4040 and http://IpMaster:8080 but it didnt show
> up
> > anything.
> >
> > Thanks,
> > Asmath.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: MLlib, Java, and DataFrame

2016-07-21 Thread VG
Why do we have these 2 packages ... ml and mlib?
What is the difference in these



On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler  wrote:

> Hi JG,
>
> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
> DataFrames.  Take a look at this example
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>
> This example uses a Dataset, which is type equivalent to a DataFrame.
>
>
> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin  wrote:
>
>> Hi,
>>
>> I am looking for some really super basic examples of MLlib (like a linear
>> regression over a list of values) in Java. I have found a few, but I only
>> saw them using JavaRDD... and not DataFrame.
>>
>> I was kind of hoping to take my current DataFrame and send them in MLlib.
>> Am I too optimistic? Do you know/have any example like that?
>>
>> Thanks!
>>
>> jg
>>
>>
>> Jean Georges Perrin
>> j...@jgp.net / @jgperrin
>>
>>
>>
>>
>>
>


Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
Hi JG,

If you didn't know this, Spark MLlib has 2 APIs, one of which uses
DataFrames.  Take a look at this example
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java

This example uses a Dataset, which is type equivalent to a DataFrame.


On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin  wrote:

> Hi,
>
> I am looking for some really super basic examples of MLlib (like a linear
> regression over a list of values) in Java. I have found a few, but I only
> saw them using JavaRDD... and not DataFrame.
>
> I was kind of hoping to take my current DataFrame and send them in MLlib.
> Am I too optimistic? Do you know/have any example like that?
>
> Thanks!
>
> jg
>
>
> Jean Georges Perrin
> j...@jgp.net / @jgperrin
>
>
>
>
>


Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ted Yu
You can use this command (assuming log aggregation is turned on):

yarn logs --applicationId XX

In the log, you should see snippet such as the following:

java.class.path=...

FYI

On Thu, Jul 21, 2016 at 9:38 PM, Ilya Ganelin  wrote:

> what's the easiest way to get the Classpath for the spark application
> itself?
>
> On Thu, Jul 21, 2016 at 9:37 PM Ted Yu  wrote:
>
>> Might be classpath issue.
>>
>> Mind pastebin'ning the effective class path ?
>>
>> Stack trace of NoClassDefFoundError may also help provide some clue.
>>
>> On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin  wrote:
>>
>>> Hello - I'm trying to deploy the Spark TimeSeries library in a new
>>> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
>>> with Java 8 installed on all nodes but I'm getting the NoClassDef at
>>> runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is
>>> part of Java 8 I feel like I shouldn't need to do anything else. The weird
>>> thing is I get it on the data nodes, not the driver. Any thoughts on what's
>>> causing this or how to track it down? Would appreciate the help.
>>>
>>> Thanks!
>>>
>>
>>


Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ilya Ganelin
what's the easiest way to get the Classpath for the spark application
itself?
On Thu, Jul 21, 2016 at 9:37 PM Ted Yu  wrote:

> Might be classpath issue.
>
> Mind pastebin'ning the effective class path ?
>
> Stack trace of NoClassDefFoundError may also help provide some clue.
>
> On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin  wrote:
>
>> Hello - I'm trying to deploy the Spark TimeSeries library in a new
>> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
>> with Java 8 installed on all nodes but I'm getting the NoClassDef at
>> runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is
>> part of Java 8 I feel like I shouldn't need to do anything else. The weird
>> thing is I get it on the data nodes, not the driver. Any thoughts on what's
>> causing this or how to track it down? Would appreciate the help.
>>
>> Thanks!
>>
>
>


Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ted Yu
Might be classpath issue.

Mind pastebin'ning the effective class path ?

Stack trace of NoClassDefFoundError may also help provide some clue.

On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin  wrote:

> Hello - I'm trying to deploy the Spark TimeSeries library in a new
> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
> with Java 8 installed on all nodes but I'm getting the NoClassDef at
> runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is
> part of Java 8 I feel like I shouldn't need to do anything else. The weird
> thing is I get it on the data nodes, not the driver. Any thoughts on what's
> causing this or how to track it down? Would appreciate the help.
>
> Thanks!
>


Applying schema on single column dataframe in java

2016-07-21 Thread raheel-akl
Hi folks, 

I am reading lines from apache webserver log file into spark data frame. A
sample line from log file is below:

*piweba4y.prodigy.com - - [01/Aug/1995:00:00:10 -0400] "GET
/images/launchmedium.gif HTTP/1.0" 200 11853*

I have split the values into /host/, /timestamp/, /path/, /status/ and
/content_size/ and apply this as schema into new dataframe.

host: piweba4y.prodigy.com
timestamp: 01/Aug/1995:00:00:10 -0400
path: /images/launchmedium.gif
status: 200
content_size: 11853

I have done all above in python thru regular expressions and then have
applied the schema (5 columns above) as well and now would like to do the
same in java. But have no clue as how to do it? I am able to split the
values by applying reg-exp library in java. Next step is to create columns
(currently each line is a column named 'value') in my DF. Can someone help
as how to do this in java? It is much easy to do in python but java seems to
be little tough.

HELP!





-




Raheel - (aspiring DS)
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Applying-schema-on-single-column-dataframe-in-java-tp27393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



MLlib, Java, and DataFrame

2016-07-21 Thread Jean Georges Perrin
Hi,

I am looking for some really super basic examples of MLlib (like a linear 
regression over a list of values) in Java. I have found a few, but I only saw 
them using JavaRDD... and not DataFrame.

I was kind of hoping to take my current DataFrame and send them in MLlib. Am I 
too optimistic? Do you know/have any example like that?

Thanks!

jg


Jean Georges Perrin
j...@jgp.net  / @jgperrin






NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ilya Ganelin
Hello - I'm trying to deploy the Spark TimeSeries library in a new
environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
with Java 8 installed on all nodes but I'm getting the NoClassDef at
runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is
part of Java 8 I feel like I shouldn't need to do anything else. The weird
thing is I get it on the data nodes, not the driver. Any thoughts on what's
causing this or how to track it down? Would appreciate the help.

Thanks!


Re: How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Andy Davidson
Thanks

Andy

From:  Saisai Shao 
Date:  Thursday, July 21, 2016 at 6:11 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: How to submit app in cluster mode? port 7077 or 6066

> I think both 6066 and 7077 can be worked. 6066 is using the REST way to submit
> application, while 7077 is the legacy way. From user's aspect, it should be
> transparent and no need to worry about the difference.
> 
> * URL: spark://hw12100.local:7077
> * REST URL: spark://hw12100.local:6066 (cluster mode)
> 
> Thanks
> Saisai
> 
> On Fri, Jul 22, 2016 at 6:44 AM, Andy Davidson 
> wrote:
>> I have some very long lived streaming apps. They have been running for
>> several months. I wonder if something has changed recently? I first started
>> working with spark-1.3 . I am using the stand alone cluster manager. The way
>> I would submit my app to run in cluster mode was port 6066
>> 
>> 
>> Looking at the spark-1.6 it seems like we are supposed to use port 7077 and
>> the  new argument
>> 
>> http://spark.apache.org/docs/latest/submitting-applications.html#launching-ap
>> plications-with-spark-submit
>> * --deploy-mode: Whether to deploy your driver on the worker nodes (cluster)
>> or locally as an external client (client) (default: client) †
>> 
>> Can anyone confirm this. It took me a very long time to figure out how to get
>> things to run cluster mode.
>> 
>> Thanks
>> 
>> Andy
> 




Re: the spark job is so slow - almost frozen

2016-07-21 Thread Gourav Sengupta
Andrew,

you have pretty much consolidated my entire experience, please give a
presentation in a meetup on this, and send across the links :)


Regards,
Gourav

On Wed, Jul 20, 2016 at 4:35 AM, Andrew Ehrlich  wrote:

> Try:
>
> - filtering down the data as soon as possible in the job, dropping columns
> you don’t need.
> - processing fewer partitions of the hive tables at a time
> - caching frequently accessed data, for example dimension tables, lookup
> tables, or other datasets that are repeatedly accessed
> - using the Spark UI to identify the bottlenecked resource
> - remove features or columns from the output data, until it runs, then add
> them back in one at a time.
> - creating a static dataset small enough to work, and editing the query,
> then retesting, repeatedly until you cut the execution time by a
> significant fraction
> - Using the Spark UI or spark shell to check the skew and make sure
> partitions are evenly distributed
>
> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu  > wrote:
>
> Thanks a lot for your reply .
>
> In effect , here we tried to run the sql on kettle, hive and spark hive
> (by HiveContext) respectively, the job seems frozen  to finish to run .
>
> In the 6 tables , need to respectively read the different columns in
> different tables for specific information , then do some simple calculation
> before output .
> join operation is used most in the sql .
>
> Best wishes!
>
>
>
>
> On Monday, July 18, 2016 6:24 PM, Chanh Le  wrote:
>
>
> Hi,
> What about the network (bandwidth) between hive and spark?
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show
> what going on.
>
>
>
>
>
>
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu  > wrote:
>
> the sql logic in the program is very much complex , so do not describe the
> detailed codes   here .
>
>
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <
> zchl.j...@yahoo.com.INVALID > wrote:
>
>
> Hi All,
>
> Here we have one application, it needs to extract different columns from 6
> hive tables, and then does some easy calculation, there is around 100,000
> number of rows in each table,
> finally need to output another table or file (with format of consistent
> columns) .
>
>  However, after lots of days trying, the spark hive job is unthinkably
> slow - sometimes almost frozen. There is 5 nodes for spark cluster.
>
> Could anyone offer some help, some idea or clue is also good.
>
> Thanks in advance~
>
> Zhiliang
>
>
>
>
>
>
>


Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Gourav Sengupta
If you are using EMR, please try their latest release, there will be very
few reasons left for using SPARK ever at all (particularly given that
hiveContext rides a lot on HIVE) if you are using SQL.

Just over regular csv data I have seen Hive on TEZ performance gains by
100x (query 64 million rows x 570 columns in 2.5 mins) , and when using ORC
 the performance gains are super fast (query 64 million rows x 570 columns
in 54 seconds) and with proper partitioning and indexing in ORC its blazing
fast (query 64 million rows x 570 columns in 19 seconds). There is perhaps
a reason why SPARK makes things slow while using ORC :)


Regards,
Gourav

On Thu, Jul 21, 2016 at 12:40 PM, Ashutosh Kumar 
wrote:

> It works. Is it better to have hive in this case for better performance ?
>
> On Thu, Jul 21, 2016 at 12:30 PM, Simone 
> wrote:
>
>> If you have a folder, and a bunch of json inside that folder- yes it
>> should work. Just set as path something like "path/to/your/folder/*.json"
>> All files will be loaded into a dataframe and schema will be the union of
>> all the different schemas of your json files (only if you have different
>> schemas)
>> It should work - let me know
>>
>> Simone Miraglia
>> --
>> Da: Ashutosh Kumar 
>> Inviato: ‎21/‎07/‎2016 08:55
>> A: Simone ; user @spark
>> 
>> Oggetto: Re: Reading multiple json files form nested folders for data
>> frame
>>
>> That example points to a particular json file. Will it work same way if I
>> point to top level folder containing all json files ?
>>
>> On Thu, Jul 21, 2016 at 12:04 PM, Simone 
>> wrote:
>>
>>> Yes you can - have a look here
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>>
>>> Hope it helps
>>>
>>> Simone Miraglia
>>> --
>>> Da: Ashutosh Kumar 
>>> Inviato: ‎21/‎07/‎2016 08:19
>>> A: user @spark 
>>> Oggetto: Reading multiple json files form nested folders for data frame
>>>
>>> I need to read bunch of json files kept in date wise folders and perform
>>> sql queries on them using data frame. Is it possible to do so? Please
>>> provide some pointers .
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>


Re: Programmatic use of UDFs from Java

2016-07-21 Thread Gourav Sengupta
JAVA seriously?

On Thu, Jul 21, 2016 at 6:10 PM, Everett Anderson 
wrote:

> Hi,
>
> In the Java Spark DataFrames API, you can create a UDF, register it, and
> then access it by string name by using the convenience UDF classes in
> org.apache.spark.sql.api.java
> 
> .
>
> Example
>
> UDF1 testUdf1 = new UDF1<>() { ... }
>
> sqlContext.udf().register("testfn", testUdf1, DataTypes.LongType);
>
> DataFrame df2 = df.withColumn("new_col", *functions.callUDF("testfn"*,
> df.col("old_col")));
>
> However, I'd like to avoid registering these by name, if possible, since I
> have many of them and would need to deal with name conflicts.
>
> There are udf() methods like this that seem to be from the Scala API
> ,
> where you don't have to register everything by name first.
>
> However, using those methods from Java would require interacting with
> Scala's scala.reflect.api.TypeTags.TypeTag. I'm having a hard time
> figuring out how to create a TypeTag from Java.
>
> Does anyone have an example of using the udf() methods from Java?
>
> Thanks!
>
> - Everett
>
>


Re: How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Saisai Shao
I think both 6066 and 7077 can be worked. 6066 is using the REST way to
submit application, while 7077 is the legacy way. From user's aspect, it
should be transparent and no need to worry about the difference.


   - *URL:* spark://hw12100.local:7077
   - *REST URL:* spark://hw12100.local:6066 (cluster mode)


Thanks
Saisai

On Fri, Jul 22, 2016 at 6:44 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> I have some very long lived streaming apps. They have been running for
> several months. I wonder if something has changed recently? I first started
> working with spark-1.3 . I am using the stand alone cluster manager. The
> way I would submit my app to run in cluster mode was port 6066
>
>
> Looking at the spark-1.6 it seems like we are supposed to use port 7077
> and the  new argument
>
>
> http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
>
>
>- --deploy-mode: Whether to deploy your driver on the worker nodes (
>cluster) or locally as an external client (client) (default: client)
>*†*
>
>
> Can anyone confirm this. It took me a very long time to figure out how to
> get things to run cluster mode.
>
> Thanks
>
> Andy
>


SVD output within Spark

2016-07-21 Thread Martin Somers
just looking at a comparision between Matlab and Spark for svd with an
input matrix N


this is matlab code - yes very small matrix

N =

2.5903   -0.04160.6023
   -0.12362.55960.7629
0.0148   -0.06930.2490



U =

   -0.3706   -0.92840.0273
   -0.92870.37080.0014
   -0.0114   -0.0248   -0.9996


Spark code

// Breeze to spark
val N1D = N.reshape(1, 9).toArray


// Note I had to transpose array to get correct values with incorrect signs
val V2D = N1D.grouped(3).toArray.transpose


// Then convert the array into a RDD
// val NVecdis = Vectors.dense(N1D.map(x => x.toDouble))
// val V2D = N1D.grouped(3).toArray


val rowlocal = V2D.map{x => Vectors.dense(x)}
val rows = sc.parallelize(rowlocal)
val mat = new RowMatrix(rows)
val mat = new RowMatrix(rows)
val svd = mat.computeSVD(mat.numCols().toInt, computeU=true)



Spark Output - notice the change in sign on the 2nd and 3rd column
-0.3158590633523746   0.9220516369164243   -0.22372713505049768
-0.8822050381939436   -0.3721920780944116  -0.28842213436035985
-0.34920956843045253  0.10627246051309004  0.9309988407367168



And finally some julia code
N  = [2.59031-0.0416335  0.602295;
-0.1235842.559640.762906;
0.0148463  -0.0693119  0.249017]

svd(N, thin=true)   --- same as matlab
-0.315859  -0.922052   0.223727
-0.882205   0.372192   0.288422
-0.34921   -0.106272  -0.930999

Most likely its an issue with my implementation rather than being a bug
with svd within the spark environment
My spark instance is running locally with a docker container
Any suggestions
tks


How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Andy Davidson
I have some very long lived streaming apps. They have been running for
several months. I wonder if something has changed recently? I first started
working with spark-1.3 . I am using the stand alone cluster manager. The way
I would submit my app to run in cluster mode was port 6066


Looking at the spark-1.6 it seems like we are supposed to use port 7077 and
the  new argument 

http://spark.apache.org/docs/latest/submitting-applications.html#launching-a
pplications-with-spark-submit
* --deploy-mode: Whether to deploy your driver on the worker nodes (cluster)
or locally as an external client (client) (default: client) †

Can anyone confirm this. It took me a very long time to figure out how to
get things to run cluster mode.

Thanks

Andy




Upgrade from 1.2 to 1.6 - parsing flat files in working directory

2016-07-21 Thread Sumona Routh
Hi all,
We are running into a classpath issue when we upgrade our application from
1.2 to 1.6.

In 1.2, we load properties from a flat file (from working directory of the
spark-submit script) using classloader resource approach. This was executed
up front (by the driver) before any processing happened.

 val confStream =
Thread.currentThread().getContextClassLoader().getResourceAsStream(appConfigPath)

confProperties.load(confStream)

In 1.6, the line getResourceAsStream returns a null and thus causes a
subsequent NullPointerException when loading the properties.

How do we pass flat files (there are many, so we really want to add a
directory to the classpath) in Spark 1.6? We haven't had much luck with
--files and --driver-class-path and spark.driver.extraClasspath. We also
couldn't find much documentation on this.

Thanks!
Sumona


Number of sortBy output partitions

2016-07-21 Thread Simone Franzini
Hi all,

I am really struggling with the behavior of sortBy. I am running sortBy on
a fairly large dataset (~20GB), that I partitioned in 1200 tasks. The
output of the sortBy stage in the Spark UI shows that it ran with 1200
tasks.

However, when I run the next operation (e.g. filter or saveToTextFile) I
find myself with only 7 partitions. The problem with this is that those
partitions are extremely skewed with 99.99% of the data being in a 12GB
partitions and everything else being in tiny partitions.

It appears (by writing to file) that the data is partitioned according to
the value that I used to sort on (as expected). The problem is that 99.99%
of the data has the same value and therefore ends up in the same partition.

I tried changing the number of tasks in the sortBy as well as a repartition
after the sortBy but to no avail. Is there any way of changing this
behavior? I fear not as this is probably due to the way that sortBy is
implemented, but I thought I would ask anyway.

Should it matter, I am running Spark 1.4.2 (DataStax Enterprise).

Thanks,

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini


Re: SparkWebUI and Master URL on EC2

2016-07-21 Thread Jacek Laskowski
Hi,

What's in the logs of spark-shell? There should be the host and port
of web UI. what the public IP of the host where you execute
spark-shell? Use it for 4040. I don't think you use Spark Standalone
cluster (the other address with 8080) if you simply spark-shell
(unless you've got spark.master in conf/spark-defaults.conf).

When you're inside spark-shell, can you sc.master? Use this for Spark
Standalone's web UI (replacing 7077 to 8080).

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jul 20, 2016 at 10:09 PM, KhajaAsmath Mohammed
 wrote:
> Hi,
>
> I got an access to spark cluser and have intstatiated spark-shell on aws
> using command
> $spark-shell.
>
> Spark shell is started successfully but I am looking to access WebUI and
> Master URL. does anyone know how to access that in AWS.
>
> I tried http://IPMaster:4040 and http://IpMaster:8080 but it didnt show up
> anything.
>
> Thanks,
> Asmath.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Thanks for the reply RK.

Using the first option, my application doesn't recognize
spark.driver.extraJavaOptions. 

With the second option, the issue remains as same,

2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext. 
org.apache.spark.SparkException: Found both spark.executor.extraJavaOptions
and SPARK_JAVA_OPTS. Use only the former. 

Looks like either of the two issue :-
1. Some where in my cluster SPARK_JAVA_OPTS is getting set, but i have done
a details review of my cluster, no where i am exporting this value.
2. There is some issue with this specific version of CDH (CDH 5.7.1 + spark
1.6.0)

-Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraJavaOptions-tp27389p27392.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: MultiThreading in Spark 1.6.0

2016-07-21 Thread RK Aduri
Thanks for the idea Maciej. The data is roughly 10 gigs.

I’m wondering if there any way to avoid the collect for each unit operation and 
somehow capture all such resultant arrays and collect them at once.

> On Jul 20, 2016, at 2:52 PM, Maciej Bryński  wrote:
> 
> RK Aduri,
> Another idea is to union all results and then run collect.
> The question is how big collected data is.
> 
> 2016-07-20 20:32 GMT+02:00 RK Aduri :
>> Spark version: 1.6.0
>> So, here is the background:
>> 
>>I have a data frame (Large_Row_DataFrame) which I have created from an
>> array of row objects and also have another array of unique ids (U_ID) which
>> I’m going to use to look up into the Large_Row_DataFrame (which is cached)
>> to do a customized function.
>>   For the each lookup for each unique id, I do a collect on the cached
>> dataframe Large_Row_DataFrame. This means that they would be a bunch of
>> ‘collect’ actions which Spark has to run. Since I’m executing this in a loop
>> for each unique id (U_ID), all the such collect actions run in sequential
>> mode.
>> 
>> Solution that I implemented:
>> 
>> To avoid the sequential wait of each collect, I have created few subsets of
>> unique ids with a specific size and run each thread for such a subset. For
>> each such subset, I executed a thread which is a spark job that runs
>> collects in sequence only for that subset. And, I have created as many
>> threads as subsets, each thread handling each subset. Surprisingly, The
>> resultant run time is better than the earlier sequential approach.
>> 
>> Now the question:
>> 
>>Is the multithreading a correct approach towards the solution? Or 
>> could
>> there be a better way of doing this.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/MultiThreading-in-Spark-1-6-0-tp27374.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> 
> 
> -- 
> Maciek Bryński


-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Understanding Spark UI DAGs

2016-07-21 Thread C. Josephson
Ok, so those line numbers in our DAG don't refer to our code. Is there any
way to display (or calculate) line numbers that refer to code we actually
wrote, or is that only possible in Scala Spark?

On Thu, Jul 21, 2016 at 12:24 PM, Jacek Laskowski  wrote:

> Hi,
>
> My little understanding of Python-Spark bridge is that at some point
> the python code communicates over the wire with Spark's backbone that
> includes PythonRDD [1].
>
> When the CallSite can't be computed, it's null:-1 to denote "nothing
> could be referred to".
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Jul 21, 2016 at 8:36 PM, C. Josephson  wrote:
> >> It's called a CallSite that shows where the line comes from. You can see
> >> the code yourself given the python file and the line number.
> >
> >
> > But that's what I don't understand. Which python file? We spark submit
> one
> > file called ctr_parsing.py, but it only has 150 lines. So what is
> > MapPartitions at PythonRDD.scala:374 referring to? ctr_parsing.py
> imports a
> > number of support functions we wrote, but how do we know which python
> file
> > to look at?
> >
> > Furthermore, what on earth is null:-1 referring to?
>



-- 
Colleen Josephson
Engineering Researcher
Uhana, Inc.


Re: Understanding Spark UI DAGs

2016-07-21 Thread RK Aduri
That -1 is coming from here:

PythonRDD.writeIteratorToStream(inputIterator, dataOut)
dataOut.writeInt(SpecialLengths.END_OF_DATA_SECTION)   —>  val 
END_OF_DATA_SECTION = -1
dataOut.writeInt(SpecialLengths.END_OF_STREAM)
dataOut.flush()

> On Jul 21, 2016, at 12:24 PM, Jacek Laskowski  wrote:
> 
> Hi,
> 
> My little understanding of Python-Spark bridge is that at some point
> the python code communicates over the wire with Spark's backbone that
> includes PythonRDD [1].
> 
> When the CallSite can't be computed, it's null:-1 to denote "nothing
> could be referred to".
> 
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Thu, Jul 21, 2016 at 8:36 PM, C. Josephson  wrote:
>>> It's called a CallSite that shows where the line comes from. You can see
>>> the code yourself given the python file and the line number.
>> 
>> 
>> But that's what I don't understand. Which python file? We spark submit one
>> file called ctr_parsing.py, but it only has 150 lines. So what is
>> MapPartitions at PythonRDD.scala:374 referring to? ctr_parsing.py imports a
>> number of support functions we wrote, but how do we know which python file
>> to look at?
>> 
>> Furthermore, what on earth is null:-1 referring to?
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.


Re: spark.driver.extraJavaOptions

2016-07-21 Thread RK Aduri
This has worked for me:
--conf "spark.driver.extraJavaOptions
-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Driver.properties"
\ 

you may want to try it.

If that doesn't work, then you may use --properties-file



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraJavaOptions-tp27389p27391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Understanding Spark UI DAGs

2016-07-21 Thread Jacek Laskowski
Hi,

My little understanding of Python-Spark bridge is that at some point
the python code communicates over the wire with Spark's backbone that
includes PythonRDD [1].

When the CallSite can't be computed, it's null:-1 to denote "nothing
could be referred to".

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 21, 2016 at 8:36 PM, C. Josephson  wrote:
>> It's called a CallSite that shows where the line comes from. You can see
>> the code yourself given the python file and the line number.
>
>
> But that's what I don't understand. Which python file? We spark submit one
> file called ctr_parsing.py, but it only has 150 lines. So what is
> MapPartitions at PythonRDD.scala:374 referring to? ctr_parsing.py imports a
> number of support functions we wrote, but how do we know which python file
> to look at?
>
> Furthermore, what on earth is null:-1 referring to?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark.driver.extraJavaOptions

2016-07-21 Thread dhruve ashar
I am not familiar with the CDH distributions. However from the exception,
you are setting both SPARK_JAVA_OPTS and specifying individually for driver
and executor.

Check for the spark-env.sh file in your spark config directory and you
could comment/remove  the SPARK_JAVA_OPTS entry and add the values to
required driver and executor java options.

On Thu, Jul 21, 2016 at 12:10 PM, SamyaMaiti 
wrote:

> Hi Team,
>
> I am using *CDH 5.7.1* with spark *1.6.0*
>
> I have a spark streaming application that read s from kafka & do some
> processing.
>
> The issue is while starting the application in CLUSTER mode, i want to pass
> custom log4j.properies file to both driver & executor.
>
> *I have the below command :-*
>
> spark-submit \
> --class xyx.search.spark.Boot \
> --conf "spark.cores.max=6" \
> --conf "spark.eventLog.enabled=true" \
> *--conf
>
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Driver.properties"
> \
> --conf
>
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Executor.properties"
> \*
> --deploy-mode "cluster" \
> /some/path/search-spark-service-1.0.0.jar \
> /some/path/conf/
>
>
> *But it gives the below exception :-*
>
> SPARK_JAVA_OPTS was detected (set to
> '-XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh ').
> This is deprecated in Spark 1.0+.
>
> Please instead use:
>  - ./spark-submit with conf/spark-defaults.conf to set defaults for an
> application
>  - ./spark-submit with --driver-java-options to set -X options for a driver
>  - spark.executor.extraJavaOptions to set -X options for executors
>  - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons
> (master
> or worker)
>
> 2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing
> SparkContext.
> org.apache.spark.SparkException: Found both spark.executor.extraJavaOptions
> and SPARK_JAVA_OPTS. Use only the former.
> at
>
> org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$5.apply(SparkConf.scala:470)
> at
>
> org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$5.apply(SparkConf.scala:468)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
>
> org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:468)
> at
>
> org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:454)
>
>
> /*Please note the same works with CDH 5.4 with spark 1.3.0.*/
>
> Regards,
> Sam
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraJavaOptions-tp27389.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
-Dhruve Ashar


Re: Understanding Spark UI DAGs

2016-07-21 Thread C. Josephson
>
> It's called a CallSite that shows where the line comes from. You can see
> the code yourself given the python file and the line number.
>

But that's what I don't understand. Which python file? We spark submit one
file called ctr_parsing.py, but it only has 150 lines. So what is
MapPartitions at PythonRDD.scala:374 referring to? ctr_parsing.py imports a
number of support functions we wrote, but how do we know which python file
to look at?

Furthermore, what on earth is null:-1 referring to?


how to resolve you must build spark with hive exception?

2016-07-21 Thread Nomii5007
Hello I know this question is already asked.. but no one answer that..that is
why I am asking again. 
I am using anaconda3.5 distribution and spark 1.6.2 
I have been following this  blog

 
. it was running fine untill i reached at 7th cell here 
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction

binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())

CV_data = CV_data.drop('State').drop('Area code') \
.drop('Total day charge').drop('Total eve charge') \
.drop('Total night charge').drop('Total intl charge') \
.withColumn('Churn', toNum(CV_data['Churn'])) \
.withColumn('International plan', toNum(CV_data['International plan']))
\
.withColumn('Voice mail plan', toNum(CV_data['Voice mail
plan'])).cache()

final_test_data = final_test_data.drop('State').drop('Area code') \
.drop('Total day charge').drop('Total eve charge') \
.drop('Total night charge').drop('Total intl charge') \
.withColumn('Churn', toNum(final_test_data['Churn'])) \
.withColumn('International plan', toNum(final_test_data['International
plan'])) \
.withColumn('Voice mail plan', toNum(final_test_data['Voice mail
plan'])).cache()


here i am getting exception 

You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt
assembly
---
Py4JJavaError Traceback (most recent call last)
 in ()
  3 
  4 binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
> 5 toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
  6 
  7 CV_data = CV_data.drop('State').drop('Area code') .drop('Total
day charge').drop('Total eve charge') .drop('Total night
charge').drop('Total intl charge') .withColumn('Churn',
toNum(CV_data['Churn'])) .withColumn('International plan',
toNum(CV_data['International plan'])) .withColumn('Voice mail plan',
toNum(CV_data['Voice mail plan'])).cache()

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\functions.py
in __init__(self, func, returnType, name)
   1556 self.returnType = returnType
   1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
   1559 
   1560 def _create_judf(self, name):

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\functions.py
in _create_judf(self, name)
   1567 pickled_command, broadcast_vars, env, includes =
_prepare_for_python_RDD(sc, command, self)
   1568 ctx = SQLContext.getOrCreate(sc)
-> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
   1570 if name is None:
   1571 name = f.__name__ if hasattr(f, '__name__') else
f.__class__.__name__

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\context.py in
_ssql_ctx(self)
681 try:
682 if not hasattr(self, '_scala_HiveContext'):
--> 683 self._scala_HiveContext = self._get_hive_ctx()
684 return self._scala_HiveContext
685 except Py4JError as e:

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\context.py in
_get_hive_ctx(self)
690 
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693 
694 def refreshTable(self, tableName):

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py
in __call__(self, *args)
   1062 answer = self._gateway_client.send_command(command)
   1063 return_value = get_return_value(
-> 1064 answer, self._gateway_client, None, self._fqn)
   1065 
   1066 for temp_arg in temp_args:

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\utils.py in
deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\lib\py4j-0.9-src.zip\py4j\protocol.py
in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createC

spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Hi Team,

I am using *CDH 5.7.1* with spark *1.6.0*

I have a spark streaming application that read s from kafka & do some
processing.

The issue is while starting the application in CLUSTER mode, i want to pass
custom log4j.properies file to both driver & executor.

*I have the below command :-*

spark-submit \
--class xyx.search.spark.Boot \
--conf "spark.cores.max=6" \
--conf "spark.eventLog.enabled=true" \
*--conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Driver.properties"
\
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Executor.properties"
\*
--deploy-mode "cluster" \
/some/path/search-spark-service-1.0.0.jar \
/some/path/conf/


*But it gives the below exception :-*

SPARK_JAVA_OPTS was detected (set to
'-XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh ').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an
application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master
or worker)

2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext.
org.apache.spark.SparkException: Found both spark.executor.extraJavaOptions
and SPARK_JAVA_OPTS. Use only the former.
at
org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$5.apply(SparkConf.scala:470)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$5.apply(SparkConf.scala:468)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:468)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:454)


/*Please note the same works with CDH 5.4 with spark 1.3.0.*/

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraJavaOptions-tp27389.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Programmatic use of UDFs from Java

2016-07-21 Thread Everett Anderson
Hi,

In the Java Spark DataFrames API, you can create a UDF, register it, and
then access it by string name by using the convenience UDF classes in
org.apache.spark.sql.api.java

.

Example

UDF1 testUdf1 = new UDF1<>() { ... }

sqlContext.udf().register("testfn", testUdf1, DataTypes.LongType);

DataFrame df2 = df.withColumn("new_col", *functions.callUDF("testfn"*,
df.col("old_col")));

However, I'd like to avoid registering these by name, if possible, since I
have many of them and would need to deal with name conflicts.

There are udf() methods like this that seem to be from the Scala API
,
where you don't have to register everything by name first.

However, using those methods from Java would require interacting with
Scala's scala.reflect.api.TypeTags.TypeTag. I'm having a hard time figuring
out how to create a TypeTag from Java.

Does anyone have an example of using the udf() methods from Java?

Thanks!

- Everett


Re: spark and plot data

2016-07-21 Thread Andy Davidson
Hi Pseudo

Plotting, graphing, data visualization, report generation are common needs
in scientific and enterprise computing.

Can you tell me more about your use case? What is it about the current
process / workflow do you think could be improved by pushing plotting (I
assume you mean plotting and graphing) into spark.


In my personal work all the graphing is done in the driver on summary stats
calculated using spark. So for me using standard python libs has not been a
problem.

Andy

From:  pseudo oduesp 
Date:  Thursday, July 21, 2016 at 8:30 AM
To:  "user @spark" 
Subject:  spark and plot data

> Hi , 
> i know spark  it s engine  to compute large data set but for me i work with
> pyspark and it s very wonderful machine
> 
> my question  we  don't have tools for ploting data each time we have to switch
> and go back to python for using plot.
> but when you have large result scatter plot or roc curve  you cant use collect
> to take data .
> 
> somone have propostion for plot .
> 
> thanks 




add hours to from_unixtimestamp

2016-07-21 Thread Divya Gehlot
Hi,
I need to add 8  hours to from_unixtimestamp
df.withColumn(from_unixtime(col("unix_timestamp"),fmt)) as "date_time"

I am try to joda time function
def unixToDateTime (unix_timestamp : String) : DateTime = {
 val utcTS = new DateTime(unix_timestamp.toLong * 1000L)+ 8.hours
  return utcTS
}

Its throwing error : java.lang.UnsupportedOperationException: Schema for
type com.github.nscala_time.time.Imports.DateTime is not supported



Would really appreciate the help.


Thanks,
Divya


RE: Role-based S3 access outside of EMR

2016-07-21 Thread Ewan Leith
If you use S3A rather than S3N, it supports IAM roles.

I think you can make s3a used for s3:// style URLs so it’s consistent with your 
EMR paths by adding this to your Hadoop config, probably in core-site.xml:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A

And make sure the s3a jars are in your classpath

Thanks,
Ewan

From: Everett Anderson [mailto:ever...@nuna.com.INVALID]
Sent: 21 July 2016 17:01
To: Gourav Sengupta 
Cc: Teng Qiu ; Andy Davidson 
; user 
Subject: Re: Role-based S3 access outside of EMR

Hey,

FWIW, we are using EMR, actually, in production.

The main case I have for wanting to access S3 with Spark outside of EMR is that 
during development, our developers tend to run EC2 sandbox instances that have 
all the rest of our code and access to some of the input data on S3. It'd be 
nice if S3 access "just worked" on these without storing the access keys in an 
exposed manner.

Teng -- when you say you use EMRFS, does that mean you copied AWS's EMRFS JAR 
from an EMR cluster and are using it outside? My impression is that AWS hasn't 
released the EMRFS implementation as part of the aws-java-sdk, so I'm wary of 
using it. Do you know if it's supported?


On Thu, Jul 21, 2016 at 2:32 AM, Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi Teng,
This is totally a flashing news for me, that people cannot use EMR in 
production because its not open sourced, I think that even Werner is not aware 
of such a problem. Is EMRFS opensourced? I am curious to know what does HA 
stand for?
Regards,
Gourav

On Thu, Jul 21, 2016 at 8:37 AM, Teng Qiu 
mailto:teng...@gmail.com>> wrote:
there are several reasons that AWS users do (can) not use EMR, one
point for us is that security compliance problem, EMR is totally not
open sourced, we can not use it in production system. second is that
EMR do not support HA yet.

but to the original question from @Everett :

-> Credentials and Hadoop Configuration

as you said, best practice should be "rely on machine roles", they
called IAM roles.

we are using EMRFS impl for accessing s3, it supports IAM role-based
access control well. you can take a look here:
https://github.com/zalando/spark/tree/branch-1.6-zalando

or simply use our docker image (Dockerfile on github:
https://github.com/zalando/spark-appliance/tree/master/Dockerfile)

docker run -d --net=host \
   -e START_MASTER="true" \
   -e START_WORKER="true" \
   -e START_WEBAPP="true" \
   -e START_NOTEBOOK="true" \
   
registry.opensource.zalan.do/bi/spark:1.6.2-6


-> SDK and File System Dependencies

as mentioned above, using EMRFS libs solved this problem:
http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html


2016-07-21 8:37 GMT+02:00 Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>:
> But that would mean you would be accessing data over internet increasing
> data read latency, data transmission failures. Why are you not using EMR?
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson 
> mailto:ever...@nuna.com.invalid>>
> wrote:
>>
>> Thanks, Andy.
>>
>> I am indeed often doing something similar, now -- copying data locally
>> rather than dealing with the S3 impl selection and AWS credentials issues.
>> It'd be nice if it worked a little easier out of the box, though!
>>
>>
>> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
>> mailto:a...@santacruzintegration.com>> wrote:
>>>
>>> Hi Everett
>>>
>>> I always do my initial data exploration and all our product development
>>> in my local dev env. I typically select a small data set and copy it to my
>>> local machine
>>>
>>> My main() has an optional command line argument ‘- - runLocal’ Normally I
>>> load data from either hdfs:/// or S3n:// . If the arg is set I read from
>>> file:///
>>>
>>> Sometime I use a CLI arg ‘- -dataFileURL’
>>>
>>> So in your case I would log into my data cluster and use “AWS s3 cp" to
>>> copy the data into my cluster and then use “SCP” to copy the data from the
>>> data center back to my local env.
>>>
>>> Andy
>>>
>>> From: Everett Anderson 
>>> mailto:ever...@nuna.com.INVALID>>
>>> Date: Tuesday, July 19, 2016 at 2:30 PM
>>> To: "user @spark" mailto:user@spark.apache.org>>
>>> Subject: Role-based S3 access outside of EMR
>>>
>>> Hi,
>>>
>>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>>> FileSystem implementation for s3:// URLs and seems to install the necessary
>>> S3 credentials properties, as well.
>>>
>>> Often, it's nice during development to run outside of a cluster even with
>>> the "local" Spark master, though, which I've found to be more troublesome.
>>> I'm curious if I'm doing this the right way.

Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Todd Nist
This is due to a change in 1.6,  by default the Thrift server runs in
multi-session mode. You would want to set the following to true on your
spark config.

spark-default.conf set spark.sql.hive.thriftServer.singleSession

Good write up here:
https://community.hortonworks.com/questions/29090/i-cant-find-my-tables-in-spark-sql-using-beeline.html

HTH.

-Todd

On Thu, Jul 21, 2016 at 10:30 AM, Marco Colombo  wrote:

> Thanks.
>
> That is just a typo. I'm using on 'spark://10.0.2.15:7077' (standalone).
> Same url used in --master in spark-submit
>
>
>
> 2016-07-21 16:08 GMT+02:00 Mich Talebzadeh :
>
>> Hi Marco
>>
>> In your code
>>
>> val conf = new SparkConf()
>>   .setMaster("spark://10.0.2.15:7077")
>>   .setMaster("local")
>>   .set("spark.cassandra.connection.host", "10.0.2.15")
>>   .setAppName("spark-sql-dataexample");
>>
>> As I understand the first .setMaster("spark://:7077 indicates
>> that you are using Spark in standalone mode and then .setMaster("local")
>> means you are using it in Local mode?
>>
>> Any reason for it?
>>
>> Basically you are overriding standalone with local.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 21 July 2016 at 14:55, Marco Colombo 
>> wrote:
>>
>>> Hi all, I have a spark application that was working in 1.5.2, but now
>>> has a problem in 1.6.2.
>>>
>>> Here is an example:
>>>
>>> val conf = new SparkConf()
>>>   .setMaster("spark://10.0.2.15:7077")
>>>   .setMaster("local")
>>>   .set("spark.cassandra.connection.host", "10.0.2.15")
>>>   .setAppName("spark-sql-dataexample");
>>>
>>> val hiveSqlContext = new HiveContext(SparkContext.getOrCreate(conf));
>>>
>>> //Registering tables
>>> var query = """OBJ_TAB""".stripMargin;
>>>
>>> val options = Map(
>>>   "driver" -> "org.postgresql.Driver",
>>>   "url" -> "jdbc:postgresql://127.0.0.1:5432/DB",
>>>   "user" -> "postgres",
>>>   "password" -> "postgres",
>>>   "dbtable" -> query);
>>>
>>> import hiveSqlContext.implicits._;
>>> val df: DataFrame =
>>> hiveSqlContext.read.format("jdbc").options(options).load();
>>> df.registerTempTable("V_OBJECTS");
>>>
>>>  val optionsC = Map("table"->"data_tab", "keyspace"->"data");
>>> val stats : DataFrame =
>>> hiveSqlContext.read.format("org.apache.spark.sql.cassandra").options(optionsC).load();
>>> //stats.foreach { x => println(x) }
>>> stats.registerTempTable("V_DATA");
>>>
>>> //START HIVE SERVER
>>> HiveThriftServer2.startWithContext(hiveSqlContext);
>>>
>>> Now, from app I can perform queries and joins over the 2 registered
>>> table, but if I connect to port 1 via beeline, I see no registered
>>> tables.
>>> show tables is empty.
>>>
>>> I'm using embedded DERBY DB, but this was working in 1.5.2.
>>>
>>> Any suggestion?
>>>
>>> Thanks
>>>
>>>
>>
>
>
> --
> Ing. Marco Colombo
>


Re: Role-based S3 access outside of EMR

2016-07-21 Thread Everett Anderson
Hey,

FWIW, we are using EMR, actually, in production.

The main case I have for wanting to access S3 with Spark outside of EMR is
that during development, our developers tend to run EC2 sandbox instances
that have all the rest of our code and access to some of the input data on
S3. It'd be nice if S3 access "just worked" on these without storing the
access keys in an exposed manner.

Teng -- when you say you use EMRFS, does that mean you copied AWS's EMRFS
JAR from an EMR cluster and are using it outside? My impression is that AWS
hasn't released the EMRFS implementation as part of the aws-java-sdk, so
I'm wary of using it. Do you know if it's supported?


On Thu, Jul 21, 2016 at 2:32 AM, Gourav Sengupta 
wrote:

> Hi Teng,
>
> This is totally a flashing news for me, that people cannot use EMR in
> production because its not open sourced, I think that even Werner is not
> aware of such a problem. Is EMRFS opensourced? I am curious to know what
> does HA stand for?
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 8:37 AM, Teng Qiu  wrote:
>
>> there are several reasons that AWS users do (can) not use EMR, one
>> point for us is that security compliance problem, EMR is totally not
>> open sourced, we can not use it in production system. second is that
>> EMR do not support HA yet.
>>
>> but to the original question from @Everett :
>>
>> -> Credentials and Hadoop Configuration
>>
>> as you said, best practice should be "rely on machine roles", they
>> called IAM roles.
>>
>> we are using EMRFS impl for accessing s3, it supports IAM role-based
>> access control well. you can take a look here:
>> https://github.com/zalando/spark/tree/branch-1.6-zalando
>>
>> or simply use our docker image (Dockerfile on github:
>> https://github.com/zalando/spark-appliance/tree/master/Dockerfile)
>>
>> docker run -d --net=host \
>>-e START_MASTER="true" \
>>-e START_WORKER="true" \
>>-e START_WEBAPP="true" \
>>-e START_NOTEBOOK="true" \
>>registry.opensource.zalan.do/bi/spark:1.6.2-6
>>
>>
>> -> SDK and File System Dependencies
>>
>> as mentioned above, using EMRFS libs solved this problem:
>>
>> http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html
>>
>>
>> 2016-07-21 8:37 GMT+02:00 Gourav Sengupta :
>> > But that would mean you would be accessing data over internet increasing
>> > data read latency, data transmission failures. Why are you not using
>> EMR?
>> >
>> > Regards,
>> > Gourav
>> >
>> > On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson
>> 
>> > wrote:
>> >>
>> >> Thanks, Andy.
>> >>
>> >> I am indeed often doing something similar, now -- copying data locally
>> >> rather than dealing with the S3 impl selection and AWS credentials
>> issues.
>> >> It'd be nice if it worked a little easier out of the box, though!
>> >>
>> >>
>> >> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
>> >>  wrote:
>> >>>
>> >>> Hi Everett
>> >>>
>> >>> I always do my initial data exploration and all our product
>> development
>> >>> in my local dev env. I typically select a small data set and copy it
>> to my
>> >>> local machine
>> >>>
>> >>> My main() has an optional command line argument ‘- - runLocal’
>> Normally I
>> >>> load data from either hdfs:/// or S3n:// . If the arg is set I read
>> from
>> >>> file:///
>> >>>
>> >>> Sometime I use a CLI arg ‘- -dataFileURL’
>> >>>
>> >>> So in your case I would log into my data cluster and use “AWS s3 cp"
>> to
>> >>> copy the data into my cluster and then use “SCP” to copy the data
>> from the
>> >>> data center back to my local env.
>> >>>
>> >>> Andy
>> >>>
>> >>> From: Everett Anderson 
>> >>> Date: Tuesday, July 19, 2016 at 2:30 PM
>> >>> To: "user @spark" 
>> >>> Subject: Role-based S3 access outside of EMR
>> >>>
>> >>> Hi,
>> >>>
>> >>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>> >>> FileSystem implementation for s3:// URLs and seems to install the
>> necessary
>> >>> S3 credentials properties, as well.
>> >>>
>> >>> Often, it's nice during development to run outside of a cluster even
>> with
>> >>> the "local" Spark master, though, which I've found to be more
>> troublesome.
>> >>> I'm curious if I'm doing this the right way.
>> >>>
>> >>> There are two issues -- AWS credentials and finding the right
>> combination
>> >>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
>> >>>
>> >>> Credentials and Hadoop Configuration
>> >>>
>> >>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY
>> and
>> >>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
>> >>> properties in Hadoop XML config files, but it seems better practice
>> to rely
>> >>> on machine roles and not expose these.
>> >>>
>> >>> What I end up doing is, in code, when not running on EMR, creating a
>> >>> DefaultAWSCredentialsProviderChain and then installing the following
>> >>> properties in the Hadoop Configuration using it:
>> >>>
>> >>> fs.s3.awsAccessKeyId
>> >>

Re: Load selected rows with sqlContext in the dataframe

2016-07-21 Thread Todd Nist
You can set the dbtable to this:

.option("dbtable", "(select * from master_schema where 'TID' = '100_0')")

HTH,

Todd


On Thu, Jul 21, 2016 at 10:59 AM, sujeet jog  wrote:

> I have a table of size 5GB, and want to load selective rows into dataframe
> instead of loading the entire table in memory,
>
>
> For me memory is a constraint hence , and i would like to peridically load
> few set of rows and perform dataframe operations on it,
>
> ,
> for the "dbtable"  is there a way to perform select * from master_schema
> where 'TID' = '100_0';
> which can load only this to memory as dataframe .
>
>
>
> Currently  I'm using code as below
> val df  =  sqlContext.read .format("jdbc")
>   .option("url", url)
>   .option("dbtable", "master_schema").load()
>
>
> Thansk,
> Sujeet
>


spark and plot data

2016-07-21 Thread pseudo oduesp
Hi ,
i know spark  it s engine  to compute large data set but for me i work with
pyspark and it s very wonderful machine

my question  we  don't have tools for ploting data each time we have to
switch and go back to python for using plot.
but when you have large result scatter plot or roc curve  you cant use
collect to take data .

somone have propostion for plot .

thanks


Upgrading a Hive External Storage Handler...

2016-07-21 Thread Lavelle, Shawn
Hello,

   I am looking to upgrade a Hive 0.11 external storage handler that was run on 
Shark 0.9.2 to work on spark-sql 1.6.1.  I’ve run into a snag in that it seems 
that the storage handler is not receiving predicate pushdown information.

   Being fairly new to Spark’s development, would someone please tell me A) Can 
I still use external storage handlers in HIVE or am I forced to use the new 
DataFrame API and B) Did I miss something in how Pushdown Predicates work or 
are accessed due to the upgrade from hive 0.11 to hive 1.2.1?

   Thank you,

~ Shawn M Lavelle



[cid:image81c412.GIF@06ba31c3.4ab1cd04]

Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com
Website: www.osii.com



Load selected rows with sqlContext in the dataframe

2016-07-21 Thread sujeet jog
I have a table of size 5GB, and want to load selective rows into dataframe
instead of loading the entire table in memory,


For me memory is a constraint hence , and i would like to peridically load
few set of rows and perform dataframe operations on it,

,
for the "dbtable"  is there a way to perform select * from master_schema
where 'TID' = '100_0';
which can load only this to memory as dataframe .



Currently  I'm using code as below
val df  =  sqlContext.read .format("jdbc")
  .option("url", url)
  .option("dbtable", "master_schema").load()


Thansk,
Sujeet


Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Marco Colombo
Thanks.

That is just a typo. I'm using on 'spark://10.0.2.15:7077' (standalone).
Same url used in --master in spark-submit



2016-07-21 16:08 GMT+02:00 Mich Talebzadeh :

> Hi Marco
>
> In your code
>
> val conf = new SparkConf()
>   .setMaster("spark://10.0.2.15:7077")
>   .setMaster("local")
>   .set("spark.cassandra.connection.host", "10.0.2.15")
>   .setAppName("spark-sql-dataexample");
>
> As I understand the first .setMaster("spark://:7077 indicates
> that you are using Spark in standalone mode and then .setMaster("local")
> means you are using it in Local mode?
>
> Any reason for it?
>
> Basically you are overriding standalone with local.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 July 2016 at 14:55, Marco Colombo 
> wrote:
>
>> Hi all, I have a spark application that was working in 1.5.2, but now has
>> a problem in 1.6.2.
>>
>> Here is an example:
>>
>> val conf = new SparkConf()
>>   .setMaster("spark://10.0.2.15:7077")
>>   .setMaster("local")
>>   .set("spark.cassandra.connection.host", "10.0.2.15")
>>   .setAppName("spark-sql-dataexample");
>>
>> val hiveSqlContext = new HiveContext(SparkContext.getOrCreate(conf));
>>
>> //Registering tables
>> var query = """OBJ_TAB""".stripMargin;
>>
>> val options = Map(
>>   "driver" -> "org.postgresql.Driver",
>>   "url" -> "jdbc:postgresql://127.0.0.1:5432/DB",
>>   "user" -> "postgres",
>>   "password" -> "postgres",
>>   "dbtable" -> query);
>>
>> import hiveSqlContext.implicits._;
>> val df: DataFrame =
>> hiveSqlContext.read.format("jdbc").options(options).load();
>> df.registerTempTable("V_OBJECTS");
>>
>>  val optionsC = Map("table"->"data_tab", "keyspace"->"data");
>> val stats : DataFrame =
>> hiveSqlContext.read.format("org.apache.spark.sql.cassandra").options(optionsC).load();
>> //stats.foreach { x => println(x) }
>> stats.registerTempTable("V_DATA");
>>
>> //START HIVE SERVER
>> HiveThriftServer2.startWithContext(hiveSqlContext);
>>
>> Now, from app I can perform queries and joins over the 2 registered
>> table, but if I connect to port 1 via beeline, I see no registered
>> tables.
>> show tables is empty.
>>
>> I'm using embedded DERBY DB, but this was working in 1.5.2.
>>
>> Any suggestion?
>>
>> Thanks
>>
>>
>


-- 
Ing. Marco Colombo


Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Mich Talebzadeh
Hi Marco

In your code

val conf = new SparkConf()
  .setMaster("spark://10.0.2.15:7077")
  .setMaster("local")
  .set("spark.cassandra.connection.host", "10.0.2.15")
  .setAppName("spark-sql-dataexample");

As I understand the first .setMaster("spark://:7077 indicates
that you are using Spark in standalone mode and then .setMaster("local")
means you are using it in Local mode?

Any reason for it?

Basically you are overriding standalone with local.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 July 2016 at 14:55, Marco Colombo  wrote:

> Hi all, I have a spark application that was working in 1.5.2, but now has
> a problem in 1.6.2.
>
> Here is an example:
>
> val conf = new SparkConf()
>   .setMaster("spark://10.0.2.15:7077")
>   .setMaster("local")
>   .set("spark.cassandra.connection.host", "10.0.2.15")
>   .setAppName("spark-sql-dataexample");
>
> val hiveSqlContext = new HiveContext(SparkContext.getOrCreate(conf));
>
> //Registering tables
> var query = """OBJ_TAB""".stripMargin;
>
> val options = Map(
>   "driver" -> "org.postgresql.Driver",
>   "url" -> "jdbc:postgresql://127.0.0.1:5432/DB",
>   "user" -> "postgres",
>   "password" -> "postgres",
>   "dbtable" -> query);
>
> import hiveSqlContext.implicits._;
> val df: DataFrame =
> hiveSqlContext.read.format("jdbc").options(options).load();
> df.registerTempTable("V_OBJECTS");
>
>  val optionsC = Map("table"->"data_tab", "keyspace"->"data");
> val stats : DataFrame =
> hiveSqlContext.read.format("org.apache.spark.sql.cassandra").options(optionsC).load();
> //stats.foreach { x => println(x) }
> stats.registerTempTable("V_DATA");
>
> //START HIVE SERVER
> HiveThriftServer2.startWithContext(hiveSqlContext);
>
> Now, from app I can perform queries and joins over the 2 registered table,
> but if I connect to port 1 via beeline, I see no registered tables.
> show tables is empty.
>
> I'm using embedded DERBY DB, but this was working in 1.5.2.
>
> Any suggestion?
>
> Thanks
>
>


init() and cleanup() for Spark map functions

2016-07-21 Thread Amit Sela
I have a use case where I use Spark (streaming) as a way to distribute a
set of computations, which requires (some) of the computations to call an
external service.
Naturally, I'd like to manage my connections (per executor/worker).

I know this pattern for DStream:
https://people.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/spark-2.0.1-SNAPSHOT-2016_07_21_04_05-f9367d6-docs/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
and
I was wondering how I'd do the same for map functions ? as I would like to
"commit" the output iterator and only afterwards "return" my connection.
And generally, how's this going to work with Structured Streaming ?

Thanks,
Amit


HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Marco Colombo
Hi all, I have a spark application that was working in 1.5.2, but now has a
problem in 1.6.2.

Here is an example:

val conf = new SparkConf()
  .setMaster("spark://10.0.2.15:7077")
  .setMaster("local")
  .set("spark.cassandra.connection.host", "10.0.2.15")
  .setAppName("spark-sql-dataexample");

val hiveSqlContext = new HiveContext(SparkContext.getOrCreate(conf));

//Registering tables
var query = """OBJ_TAB""".stripMargin;

val options = Map(
  "driver" -> "org.postgresql.Driver",
  "url" -> "jdbc:postgresql://127.0.0.1:5432/DB",
  "user" -> "postgres",
  "password" -> "postgres",
  "dbtable" -> query);

import hiveSqlContext.implicits._;
val df: DataFrame =
hiveSqlContext.read.format("jdbc").options(options).load();
df.registerTempTable("V_OBJECTS");

 val optionsC = Map("table"->"data_tab", "keyspace"->"data");
val stats : DataFrame =
hiveSqlContext.read.format("org.apache.spark.sql.cassandra").options(optionsC).load();
//stats.foreach { x => println(x) }
stats.registerTempTable("V_DATA");

//START HIVE SERVER
HiveThriftServer2.startWithContext(hiveSqlContext);

Now, from app I can perform queries and joins over the 2 registered table,
but if I connect to port 1 via beeline, I see no registered tables.
show tables is empty.

I'm using embedded DERBY DB, but this was working in 1.5.2.

Any suggestion?

Thanks


Re: ML PipelineModel to be scored locally

2016-07-21 Thread Robin East
MLeap is another option (Apache licensed) https://github.com/TrueCar/mleap


---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 21 Jul 2016, at 06:47, Simone  wrote:
> 
> Thanks for your reply. 
> 
> I cannot rely on jpmml due licensing stuff.
> I can evaluate writing my own prediction code, but I am looking for a more 
> general purpose approach. 
> 
> Any other thoughts?
> Best
> Simone
> Da: Peyman Mohajerian 
> Inviato: ‎20/‎07/‎2016 21:55
> A: Simone Miraglia 
> Cc: User 
> Oggetto: Re: ML PipelineModel to be scored locally
> 
> One option is to save the model in parquet or json format and then build your 
> own prediction code. Some also use: 
> https://github.com/jpmml/jpmml-sparkml 
> 
> It depends on the model, e.g. ml v mllib and other factors whether this works 
> on or not. Couple of weeks ago there was a long discussion on this topic.
> 
> 
> On Wed, Jul 20, 2016 at 7:08 AM, Simone Miraglia  > wrote:
> Hi all,
> 
> I am working on the following use case involving ML Pipelines.
> 
> 1. I created a Pipeline composed from a set of stages
> 2. I called "fit" method on my training set
> 3. I validated my model by calling "transform" on my test set
> 4. I stored my fitted Pipeline to a shared folder
> 
> Then I have a very low latency interactive application (say a kinda of web 
> service), that should work as follows:
> 1. The app receives a request
> 2. A scoring needs to be made, according to my fitted PipelineModel
> 3. The app sends the score to the caller, in a synchronous fashion
> 
> Is there a way to call the .transform method of the PipelineModel over a 
> single Row?
> 
> I will definitely not want to parallelize a single record to a DataFrame, nor 
> relying on Spark Streaming due to latency requirements.
> I would like to use something similar to mllib .predict(Vector) method which 
> does not rely on Spark Context performing all the computation locally.
> 
> Thanks in advance
> Best
> 



Using RDD.checkpoint to recover app failure

2016-07-21 Thread harelglik
I am writing a Spark application that has many iterations.
I am planning to checkpoint on every Nth iteration to cut the graph of my
rdd and clear previous shuffle files.
I would also like to be able to restart my application completely using the
last checkpoint.

I understand that regular checkpoint will work inside the same app, but how
can I read the checkpointed rdd in case I launch the new app?

In Spark streaming there seems to be support for recreating the full context
from a checkpoint, but I can't figure out how to do it for non-streaming
Spark.

Many thanks,
Harel.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-RDD-checkpoint-to-recover-app-failure-tp27383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Ashutosh Kumar
It works. Is it better to have hive in this case for better performance ?

On Thu, Jul 21, 2016 at 12:30 PM, Simone  wrote:

> If you have a folder, and a bunch of json inside that folder- yes it
> should work. Just set as path something like "path/to/your/folder/*.json"
> All files will be loaded into a dataframe and schema will be the union of
> all the different schemas of your json files (only if you have different
> schemas)
> It should work - let me know
>
> Simone Miraglia
> --
> Da: Ashutosh Kumar 
> Inviato: ‎21/‎07/‎2016 08:55
> A: Simone ; user @spark 
> Oggetto: Re: Reading multiple json files form nested folders for data
> frame
>
> That example points to a particular json file. Will it work same way if I
> point to top level folder containing all json files ?
>
> On Thu, Jul 21, 2016 at 12:04 PM, Simone 
> wrote:
>
>> Yes you can - have a look here
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>
>> Hope it helps
>>
>> Simone Miraglia
>> --
>> Da: Ashutosh Kumar 
>> Inviato: ‎21/‎07/‎2016 08:19
>> A: user @spark 
>> Oggetto: Reading multiple json files form nested folders for data frame
>>
>> I need to read bunch of json files kept in date wise folders and perform
>> sql queries on them using data frame. Is it possible to do so? Please
>> provide some pointers .
>>
>> Thanks
>> Ashutosh
>>
>
>


RE: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-21 Thread Joaquin Alzola
You have the same as link 1 but in English?

  *   
spark-questions-concepts
  *   deep-into-spark-exection-model 

Seems really interesting post but in Chinese. I suppose google translate suck 
on the translation.


From: Taotao.Li [mailto:charles.up...@gmail.com]
Sent: 21 July 2016 04:04
To: Jean Georges Perrin 
Cc: Sachin Mittal ; user 
Subject: Re: Understanding spark concepts cluster, master, slave, job, stage, 
worker, executor, task

Hi, Sachin,  here are two posts about the basic concepts about spark:


  *   
spark-questions-concepts
  *   deep-into-spark-exection-model 


And, I fully recommend databrick's post: 
https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html


On Thu, Jul 21, 2016 at 1:36 AM, Jean Georges Perrin 
mailto:j...@jgp.net>> wrote:
Hey,

I love when questions are numbered, it's easier :)

1) Yes (but I am not an expert)
2) You don't control... One of my process is going to 8k tasks, so...
3) Yes, if you have HT, it double. My servers have 12 cores, but HT, so it 
makes 24.
4) From my understanding: Slave is the logical computational unit and Worker is 
really the one doing the job.
5) Dunnoh
6) Dunnoh

On Jul 20, 2016, at 1:30 PM, Sachin Mittal 
mailto:sjmit...@gmail.com>> wrote:

Hi,
I was able to build and run my spark application via spark submit.
I have understood some of the concepts by going through the resources at 
https://spark.apache.org but few doubts still 
remain. I have few specific questions and would be glad if someone could share 
some light on it.
So I submitted the application using spark.masterlocal[*] and I have a 8 
core PC.

- What I understand is that application is called as job. Since mine had two 
stages it gets divided into 2 stages and each stage had number of tasks which 
ran in parallel.
Is this understanding correct.

- What I notice is that each stage is further divided into 262 tasks From where 
did this number 262 came from. Is this configurable. Would increasing this 
number improve performance.
- Also I see that the tasks are run in parallel in set of 8. Is this because I 
have a 8 core PC.
- What is the difference or relation between slave and worker. When I did 
spark-submit did it start 8 slaves or worker threads?
- I see all worker threads running in one single JVM. Is this because I did not 
start  slaves separately and connect it to a single master cluster manager. If 
I had done that then each worker would have run in its own JVM.
- What is the relationship between worker and executor. Can a worker have more 
than one executors? If yes then how do we configure that. Does all executor run 
in the worker JVM and are independent threads.
I suppose that is all for now. Would appreciate any response.Will add followup 
questions if any.
Thanks
Sachin





--
___
Quant | Engineer | Boy
___
blog:
http://litaotao.github.io
github: www.github.com/litaotao
This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.


writing Kafka dstream to local flat file

2016-07-21 Thread Puneet Tripathi
Hi, I am trying to consume from Kafka topics following 
http://spark.apache.org/docs/latest/streaming-kafka-integration.html Approach 
one(createStream). I am not able to write it to local text file using 
saveAsTextFiles() function. Below is the code

import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
zkQuorum, topic = 'localhost:9092', 'python-kafka'

kafka_stream = KafkaUtils.createStream(ssc, zkQuorum,None, {topic: 1})
lines = kafka_stream.map(lambda x: x[1])

kafka_stream.saveAsTextFiles('file:///home/puneett/')

When I access the consumer I get following output

[puneett@gb-slo-svb-0255 ~]$ 
/nfs/science/shared/kafka/kafka/bin/kafka-console-consumer.sh --topic 
python-kafka --property schema.registry.url="http://localhost:9092"; --zookeeper 
localhost:2182 --from-beginning
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/nfs/science/shared/kafka/kafka/core/build/dependant-libs-2.10.6/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/nfs/science/shared/kafka/kafka/tools/build/dependant-libs-2.10.6/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/nfs/science/shared/kafka/kafka/connect/api/build/dependant-libs/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/nfs/science/shared/kafka/kafka/connect/runtime/build/dependant-libs/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/nfs/science/shared/kafka/kafka/connect/file/build/dependant-libs/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/nfs/science/shared/kafka/kafka/connect/json/build/dependant-libs/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
There are 1 more similar kafka test message produced
There are 1 more similar kafka test message produced
There are 1 more similar kafka test message produced
There are 1 more similar kafka test message produced
There are 1 more similar kafka test message produced
There are 1 more similar kafka test message produced
There are 1 more similar kafka test message produced
There

Please can someone suggest what am I doing wrong?
Regards,
Puneet
dunnhumby limited is a limited company registered in England and Wales with 
registered number 02388853 and VAT registered number 927 5871 83. Our 
registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL. The 
contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error you should 
delete it from your system immediately and advise the sender. dunnhumby may 
monitor and record all emails. The views expressed in this email are those of 
the sender and not those of dunnhumby.


Re: Spark Job trigger in production

2016-07-21 Thread Lars Albertsson
I assume that you would like to trigger Spark batch jobs, and not
streaming jobs.

For production jobs, I recommend avoiding scheduling batch jobs
directly with cron or cron services like Chronos. Sometimes, jobs will
fail, either due to missing input data, or due to execution problems.
When it happens, you will need a mechanism to backfill missing
datasets by retrying jobs, or your system will be brittle.

The component that does this for you is called a workflow manager. I
suggest using either Luigi (https://github.com/spotify/luigi) or
Airflow (https://github.com/apache/incubator-airflow). You will need
to periodically schedule the workflow manager to evaluate your
pipeline status and run jobs (at least with Luigi), but the workflow
manager verifies input data presence before starting jobs, and can
cover up for transient failures and delayed input data, making the
system as a whole stable.

Oozie, mentioned in this thread, is also a workflow manager. It has an
XML-based DSL, however. It is therefore syntax-wise clumsy, and
limited in expressivity, which prevents you from using it for some
complex, but common scenarios, e.g. pipelines requiring dynamic
dependencies.

Some frameworks for running services are capable of also executing
batch jobs, e.g. Aurora and Kubernetes. They have weak DSLs for
expressing dependencies, however, and are suitable only if you have
simple pipelines. They are useful if you want to run Spark Streaming
jobs, however. Marathon did not support batch jobs last I checked, and
is only useful for streaming scenarios.

You can find more context and advice on running batch jobs in
production from the resources in this list, under the sections "End to
end" and "Batch processing":
http://www.mapflat.com/lands/resources/reading-list/

Regards,


Lars Albertsson
Data engineering consultant
www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109
Calendar: https://goo.gl/6FBtlS



On Wed, Jul 20, 2016 at 3:47 PM, Sathish Kumaran Vairavelu
 wrote:
> If you are using Mesos, then u can use Chronos or Marathon
>
> On Wed, Jul 20, 2016 at 6:08 AM Rabin Banerjee
>  wrote:
>>
>> ++ crontab :)
>>
>> On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich 
>> wrote:
>>>
>>> Another option is Oozie with the spark action:
>>> https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
>>>
>>> On Jul 18, 2016, at 12:15 AM, Jagat Singh  wrote:
>>>
>>> You can use following options
>>>
>>> * spark-submit from shell
>>> * some kind of job server. See spark-jobserver for details
>>> * some notebook environment See Zeppelin for example
>>>
>>>
>>>
>>>
>>>
>>> On 18 July 2016 at 17:13, manish jaiswal  wrote:

 Hi,


 What is the best approach to trigger spark job in production cluster?
>>>
>>>
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



what contribute to Task Deserialization Time

2016-07-21 Thread patcharee

Hi,

I'm running a simple job (reading sequential file and collect data at 
the driver) with yarn-client mode. When looking at the history server 
UI, Task Deserialization Time of tasks are quite different (5 ms to 5 
s). What contribute to this Task Deserialization Time?


Thank you in advance!

Patcharee



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Role-based S3 access outside of EMR

2016-07-21 Thread Gourav Sengupta
Hi Teng,

This is totally a flashing news for me, that people cannot use EMR in
production because its not open sourced, I think that even Werner is not
aware of such a problem. Is EMRFS opensourced? I am curious to know what
does HA stand for?

Regards,
Gourav

On Thu, Jul 21, 2016 at 8:37 AM, Teng Qiu  wrote:

> there are several reasons that AWS users do (can) not use EMR, one
> point for us is that security compliance problem, EMR is totally not
> open sourced, we can not use it in production system. second is that
> EMR do not support HA yet.
>
> but to the original question from @Everett :
>
> -> Credentials and Hadoop Configuration
>
> as you said, best practice should be "rely on machine roles", they
> called IAM roles.
>
> we are using EMRFS impl for accessing s3, it supports IAM role-based
> access control well. you can take a look here:
> https://github.com/zalando/spark/tree/branch-1.6-zalando
>
> or simply use our docker image (Dockerfile on github:
> https://github.com/zalando/spark-appliance/tree/master/Dockerfile)
>
> docker run -d --net=host \
>-e START_MASTER="true" \
>-e START_WORKER="true" \
>-e START_WEBAPP="true" \
>-e START_NOTEBOOK="true" \
>registry.opensource.zalan.do/bi/spark:1.6.2-6
>
>
> -> SDK and File System Dependencies
>
> as mentioned above, using EMRFS libs solved this problem:
>
> http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html
>
>
> 2016-07-21 8:37 GMT+02:00 Gourav Sengupta :
> > But that would mean you would be accessing data over internet increasing
> > data read latency, data transmission failures. Why are you not using EMR?
> >
> > Regards,
> > Gourav
> >
> > On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson
> 
> > wrote:
> >>
> >> Thanks, Andy.
> >>
> >> I am indeed often doing something similar, now -- copying data locally
> >> rather than dealing with the S3 impl selection and AWS credentials
> issues.
> >> It'd be nice if it worked a little easier out of the box, though!
> >>
> >>
> >> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
> >>  wrote:
> >>>
> >>> Hi Everett
> >>>
> >>> I always do my initial data exploration and all our product development
> >>> in my local dev env. I typically select a small data set and copy it
> to my
> >>> local machine
> >>>
> >>> My main() has an optional command line argument ‘- - runLocal’
> Normally I
> >>> load data from either hdfs:/// or S3n:// . If the arg is set I read
> from
> >>> file:///
> >>>
> >>> Sometime I use a CLI arg ‘- -dataFileURL’
> >>>
> >>> So in your case I would log into my data cluster and use “AWS s3 cp" to
> >>> copy the data into my cluster and then use “SCP” to copy the data from
> the
> >>> data center back to my local env.
> >>>
> >>> Andy
> >>>
> >>> From: Everett Anderson 
> >>> Date: Tuesday, July 19, 2016 at 2:30 PM
> >>> To: "user @spark" 
> >>> Subject: Role-based S3 access outside of EMR
> >>>
> >>> Hi,
> >>>
> >>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
> >>> FileSystem implementation for s3:// URLs and seems to install the
> necessary
> >>> S3 credentials properties, as well.
> >>>
> >>> Often, it's nice during development to run outside of a cluster even
> with
> >>> the "local" Spark master, though, which I've found to be more
> troublesome.
> >>> I'm curious if I'm doing this the right way.
> >>>
> >>> There are two issues -- AWS credentials and finding the right
> combination
> >>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
> >>>
> >>> Credentials and Hadoop Configuration
> >>>
> >>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY
> and
> >>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
> >>> properties in Hadoop XML config files, but it seems better practice to
> rely
> >>> on machine roles and not expose these.
> >>>
> >>> What I end up doing is, in code, when not running on EMR, creating a
> >>> DefaultAWSCredentialsProviderChain and then installing the following
> >>> properties in the Hadoop Configuration using it:
> >>>
> >>> fs.s3.awsAccessKeyId
> >>> fs.s3n.awsAccessKeyId
> >>> fs.s3a.awsAccessKeyId
> >>> fs.s3.awsSecretAccessKey
> >>> fs.s3n.awsSecretAccessKey
> >>> fs.s3a.awsSecretAccessKey
> >>>
> >>> I also set the fs.s3.impl and fs.s3n.impl properties to
> >>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
> >>> implementation since people usually use "s3://" URIs.
> >>>
> >>> SDK and File System Dependencies
> >>>
> >>> Some special combination of the Hadoop version, AWS SDK version, and
> >>> hadoop-aws is necessary.
> >>>
> >>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me
> seems
> >>> to be with
> >>>
> >>> --packages
> >>> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
> >>>
> >>> Is this generally what people do? Is there a better way?
> >>>
> >>> I realize this isn't entirely a Spark-specific problem, but as so many
> >>> peo

Re: write and call UDF in spark dataframe

2016-07-21 Thread Kabeer Ahmed
Divya:

https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html

The link gives a complete example of registering a udAf - user defined 
aggregate function. This is a complete example and this example should give you 
a complete idea of registering a UDF. If you still need a hand let us know.

Thanks
Kabeer.

Sent from Nylas 
N1,
 the extensible, open source mail client.

On Jul 21 2016, at 8:13 am, Jacek Laskowski  wrote:

On Thu, Jul 21, 2016 at 5:53 AM, Mich Talebzadeh
 wrote:
> something similar

Is this going to be in Scala?

> def ChangeToDate (word : String) : Date = {
> //return
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(word,"dd/MM/"),"-MM-dd"))
> val d1 = Date.valueOf(ReverseDate(word))
> return d1
> }
> sqlContext.udf.register("ChangeToDate", ChangeToDate(_:String))

then...please use lowercase method names and *no* return please ;-)

BTW, no sqlContext as of Spark 2.0. Sorry.../me smiling nicely

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: calculate time difference between consecutive rows

2016-07-21 Thread ayan guha
Please post your code and results. Lag will be null for the first record.
Also, what data type you are using? Are you using cast?
On 21 Jul 2016 14:28, "Divya Gehlot"  wrote:

> I have a dataset of time as shown below :
> Time1
> 07:30:23
> 07:34:34
> 07:38:23
> 07:39:12
> 07:45:20
>
> I need to find the diff between two consecutive rows
> I googled and found the *lag *function in *spark *helps in finding it .
> but its  giving me *null *in the result set.
>
> Would really appreciate the help.
>
>
> Thanks,
> Divya
>
>


Re: Role-based S3 access outside of EMR

2016-07-21 Thread Teng Qiu
there are several reasons that AWS users do (can) not use EMR, one
point for us is that security compliance problem, EMR is totally not
open sourced, we can not use it in production system. second is that
EMR do not support HA yet.

but to the original question from @Everett :

-> Credentials and Hadoop Configuration

as you said, best practice should be "rely on machine roles", they
called IAM roles.

we are using EMRFS impl for accessing s3, it supports IAM role-based
access control well. you can take a look here:
https://github.com/zalando/spark/tree/branch-1.6-zalando

or simply use our docker image (Dockerfile on github:
https://github.com/zalando/spark-appliance/tree/master/Dockerfile)

docker run -d --net=host \
   -e START_MASTER="true" \
   -e START_WORKER="true" \
   -e START_WEBAPP="true" \
   -e START_NOTEBOOK="true" \
   registry.opensource.zalan.do/bi/spark:1.6.2-6


-> SDK and File System Dependencies

as mentioned above, using EMRFS libs solved this problem:
http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html


2016-07-21 8:37 GMT+02:00 Gourav Sengupta :
> But that would mean you would be accessing data over internet increasing
> data read latency, data transmission failures. Why are you not using EMR?
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson 
> wrote:
>>
>> Thanks, Andy.
>>
>> I am indeed often doing something similar, now -- copying data locally
>> rather than dealing with the S3 impl selection and AWS credentials issues.
>> It'd be nice if it worked a little easier out of the box, though!
>>
>>
>> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
>>  wrote:
>>>
>>> Hi Everett
>>>
>>> I always do my initial data exploration and all our product development
>>> in my local dev env. I typically select a small data set and copy it to my
>>> local machine
>>>
>>> My main() has an optional command line argument ‘- - runLocal’ Normally I
>>> load data from either hdfs:/// or S3n:// . If the arg is set I read from
>>> file:///
>>>
>>> Sometime I use a CLI arg ‘- -dataFileURL’
>>>
>>> So in your case I would log into my data cluster and use “AWS s3 cp" to
>>> copy the data into my cluster and then use “SCP” to copy the data from the
>>> data center back to my local env.
>>>
>>> Andy
>>>
>>> From: Everett Anderson 
>>> Date: Tuesday, July 19, 2016 at 2:30 PM
>>> To: "user @spark" 
>>> Subject: Role-based S3 access outside of EMR
>>>
>>> Hi,
>>>
>>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>>> FileSystem implementation for s3:// URLs and seems to install the necessary
>>> S3 credentials properties, as well.
>>>
>>> Often, it's nice during development to run outside of a cluster even with
>>> the "local" Spark master, though, which I've found to be more troublesome.
>>> I'm curious if I'm doing this the right way.
>>>
>>> There are two issues -- AWS credentials and finding the right combination
>>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
>>>
>>> Credentials and Hadoop Configuration
>>>
>>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
>>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
>>> properties in Hadoop XML config files, but it seems better practice to rely
>>> on machine roles and not expose these.
>>>
>>> What I end up doing is, in code, when not running on EMR, creating a
>>> DefaultAWSCredentialsProviderChain and then installing the following
>>> properties in the Hadoop Configuration using it:
>>>
>>> fs.s3.awsAccessKeyId
>>> fs.s3n.awsAccessKeyId
>>> fs.s3a.awsAccessKeyId
>>> fs.s3.awsSecretAccessKey
>>> fs.s3n.awsSecretAccessKey
>>> fs.s3a.awsSecretAccessKey
>>>
>>> I also set the fs.s3.impl and fs.s3n.impl properties to
>>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
>>> implementation since people usually use "s3://" URIs.
>>>
>>> SDK and File System Dependencies
>>>
>>> Some special combination of the Hadoop version, AWS SDK version, and
>>> hadoop-aws is necessary.
>>>
>>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems
>>> to be with
>>>
>>> --packages
>>> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
>>>
>>> Is this generally what people do? Is there a better way?
>>>
>>> I realize this isn't entirely a Spark-specific problem, but as so many
>>> people seem to be using S3 with Spark, I imagine this community's faced the
>>> problem a lot.
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: XLConnect in SparkR

2016-07-21 Thread Marco Mistroni
Hi,
 have you tried to use spark-csv (https://github.com/databricks/spark-csv)
?  after all you can reconduct an XL file to  CSV
hth.

On Thu, Jul 21, 2016 at 4:25 AM, Felix Cheung 
wrote:

> From looking at be CLConnect package, its loadWorkbook() function only
> supports reading from local file path, so you might need a way to call HDFS
> command to get the file from HDFS first.
>
> SparkR currently does not support this - you could read it in as a text
> file (I don't think .xlsx is a text format though), collect to get all the
> data at the driver, then save to local path perhaps?
>
>
>
>
>
> On Wed, Jul 20, 2016 at 3:48 AM -0700, "Rabin Banerjee" <
> dev.rabin.baner...@gmail.com> wrote:
>
> Hi Yogesh ,
>
>   I have never tried reading XLS files using Spark . But I think you can
> use sc.wholeTextFiles  to read the complete xls at once , as xls files
> are xml internally, you need to read them all to parse . Then I think you
> can use apache poi to read them .
>
> Also, you can copy you XLS data to a MS-Access file to access via JDBC ,
>
> Regards,
> Rabin Banerjee
>
> On Wed, Jul 20, 2016 at 2:12 PM, Yogesh Vyas  wrote:
>
>> Hi,
>>
>> I am trying to load and read excel sheets from HDFS in sparkR using
>> XLConnect package.
>> Can anyone help me in finding out how to read xls files from HDFS in
>> sparkR ?
>>
>> Regards,
>> Yogesh
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Best practices to restart Spark jobs programatically from driver itself

2016-07-21 Thread Lars Albertsson
You can use a workflow manager, which gives you tools to handle transient
failures in data pipelines. I suggest either Luigi or Airflow. They provide
DSLs embedded in Python, so if the primitives provided are insufficient, it
is easy to customise Spark tasks with restart logic.

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109
Calendar: https://goo.gl/tV2hWF

On Jul 20, 2016 17:12, "unk1102"  wrote:

> Hi I have multiple long running spark jobs which many times hangs because
> of
> multi tenant Hadoop cluster and resource scarcity. I am thinking of
> restarting spark job within driver itself. For e.g. if spark job does not
> write output files for say 30 minutes then I want to restart spark job by
> itself I mean submitting itself from scratch to yarn cluster. Is it
> possible? Is there any best practices for Spark job restart? Please guide.
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-to-restart-Spark-jobs-programatically-from-driver-itself-tp27371.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-21 Thread Ravi Aggarwal
Hi Ian,

Thanks for the information.

I think you are referring to post 
http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-decides-whether-to-do-BroadcastHashJoin-or-SortMergeJoin-td27369.html.

Yeah I could solve above issue of mine using 
spark.sql.autoBroadcastJoinThreshold=-1, so that it always results in 
Sort-Merge join instead of BroadcastHashJoin, Rather ideal fix for me is to 
calculate size of my custom default source (BaseRelation’s sizeInBytes) in 
right manner, to make spark planner take appropriate decision for me.

Thanks
Ravi

From: ianoconn...@gmail.com [mailto:ianoconn...@gmail.com] On Behalf Of Ian 
O'Connell
Sent: Wednesday, July 20, 2016 11:05 PM
To: Ravi Aggarwal 
Cc: Ted Yu ; user 
Subject: Re: OutOfMemory when doing joins in spark 2.0 while same code runs 
fine in spark 1.5.2

Ravi did your issue ever get solved for this?

I think i've been hitting the same thing, it looks like the 
spark.sql.autoBroadcastJoinThreshold stuff isn't kicking in as expected, if I 
set that to -1 then the computation proceeds successfully.

On Tue, Jun 14, 2016 at 12:28 AM, Ravi Aggarwal 
mailto:raagg...@adobe.com>> wrote:
Hi,

Is there any breakthrough here?

I had one more observation while debugging the issue
Here are the 4 types of data I had:

Da -> stored in parquet
Di -> stored in parquet
Dl1 -> parquet version of lookup
Dl2 -> hbase version of lookup

Joins performed and type of join done by spark:
Da and Di Sort-merge failed (OOM)
Da and Dl1   B-H passed
Da and Dl2   Sort-Mergepassed
Di and Dl1B-H passed
Di and Dl2Sort-Mergefailed (OOM)

From entries I can deduce that problem is with sort-merge join involving Di.
So the hbase thing is out of equation, that is not the culprit.
In physical plan I could see there are only two operations that are done 
additionally in sort-merge as compared to Broadcast-hash.

==> Exchange Hashpartitioning

==> Sort
And finally sort-merge join.

Can we deduce anything from this?

Thanks
Ravi
From: Ravi Aggarwal
Sent: Friday, June 10, 2016 12:31 PM
To: 'Ted Yu' mailto:yuzhih...@gmail.com>>
Cc: user mailto:user@spark.apache.org>>
Subject: RE: OutOfMemory when doing joins in spark 2.0 while same code runs 
fine in spark 1.5.2

Hi Ted,
Thanks for the reply.

Here is the code
Btw – df.count is running fine on dataframe generated from this default source. 
I think it is something in the combination of join and hbase data source that 
is creating issue. But not sure entirely.
I have also dumped the physical plans of both approaches s3a/s3a join and 
s3a/hbase join, In case you want that let me know.

import org.apache.hadoop.fs.FileStatus
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat}
import org.apache.hadoop.hbase._
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.CatalystTypeConverters
import org.apache.spark.sql.catalyst.expressions.{Cast, Literal}
import org.apache.spark.sql.execution.datasources.{OutputWriterFactory, 
FileFormat}
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.slf4j.LoggerFactory

class DefaultSource extends SchemaRelationProvider with FileFormat {

  override def createRelation(sqlContext: SQLContext, parameters: Map[String, 
String], schema: StructType) = {
new HBaseRelation(schema, parameters)(sqlContext)
  }

  def inferSchema(sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType] = ???

  def prepareWrite(sparkSession: SparkSession,
   job: Job,
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory = ???
}

object HBaseConfigurationUtil {
  lazy val logger = LoggerFactory.getLogger("HBaseConfigurationUtil")
  val hbaseConfiguration = (tableName: String, hbaseQuorum: String) => {
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, tableName)
conf.set("hbase.mapred.outputtable", tableName)
conf.set("hbase.zookeeper.quorum", hbaseQuorum)
conf
  }
}

class HBaseRelation(val schema: StructType, parameters: Map[String, String])
   (@transient val sqlContext: SQLContext) extends BaseRelation 
with TableScan {

  import sqlContext.sparkContext

  override def buildScan(): RDD[Row] = {

val bcDataSchema = sparkContext.broadcast(schema)

val tableName = parameters.get("path") match {
  case Some(t) => t
  case _ => throw new RuntimeException("Table name (path) not provided in 
parameters")
}

val hbaseQuorum = parameters.get("hbaseQuorum") match {
  case Some(s: String) => s
  case _ => throw new RuntimeExcep

Re: Where is the SparkSQL Specification?

2016-07-21 Thread Mich Talebzadeh
Spark SQL is a subset of Hive SQL which  by and large supports ANSI 92 SQL
including search parameters like above

scala> sqlContext.sql("select count(1) from oraclehadoop.channels where
channel_desc like ' %b_xx%'").show
+---+
|_c0|
+---+
|  0|
+---+

So check Hive QL Language support
HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 July 2016 at 08:20, Linyuxin  wrote:

> Hi All
>
> Newbee here.
>
> My spark version is 1.5.1
>
>
>
> And I want to know how can I find the Specification of Spark SQL to find
> out that if it is supported ‘a like %b_xx’ or other sql syntax
>


Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-21 Thread Taotao.Li
Hi, Sachin, there is no planning on translate these into english currently,
sorry for that, but you can check databrick's blog, there are lots of
high-quality and easy-understanding posts.

or you can check the list in this post of mine, choose the English version:


   - spark-resouces-blogs-paper
   


On Thu, Jul 21, 2016 at 12:19 PM, Sachin Mittal  wrote:

> Hi,
> Thanks for the links, is there any english translation for the same?
>
> Sachin
>
>
> On Thu, Jul 21, 2016 at 8:34 AM, Taotao.Li 
> wrote:
>
>> Hi, Sachin,  here are two posts about the basic concepts about spark:
>>
>>
>>- spark-questions-concepts
>>
>>- deep-into-spark-exection-model
>>
>>
>>
>> And, I fully recommend databrick's post:
>> https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html
>>
>>
>> On Thu, Jul 21, 2016 at 1:36 AM, Jean Georges Perrin  wrote:
>>
>>> Hey,
>>>
>>> I love when questions are numbered, it's easier :)
>>>
>>> 1) Yes (but I am not an expert)
>>> 2) You don't control... One of my process is going to 8k tasks, so...
>>> 3) Yes, if you have HT, it double. My servers have 12 cores, but HT, so
>>> it makes 24.
>>> 4) From my understanding: Slave is the logical computational unit and
>>> Worker is really the one doing the job.
>>> 5) Dunnoh
>>> 6) Dunnoh
>>>
>>> On Jul 20, 2016, at 1:30 PM, Sachin Mittal  wrote:
>>>
>>> Hi,
>>> I was able to build and run my spark application via spark submit.
>>>
>>> I have understood some of the concepts by going through the resources at
>>> https://spark.apache.org but few doubts still remain. I have few
>>> specific questions and would be glad if someone could share some light on
>>> it.
>>>
>>> So I submitted the application using spark.masterlocal[*] and I have
>>> a 8 core PC.
>>>
>>> - What I understand is that application is called as job. Since mine had
>>> two stages it gets divided into 2 stages and each stage had number of tasks
>>> which ran in parallel.
>>> Is this understanding correct.
>>>
>>> - What I notice is that each stage is further divided into 262 tasks
>>> From where did this number 262 came from. Is this configurable. Would
>>> increasing this number improve performance.
>>>
>>> - Also I see that the tasks are run in parallel in set of 8. Is this
>>> because I have a 8 core PC.
>>>
>>> - What is the difference or relation between slave and worker. When I
>>> did spark-submit did it start 8 slaves or worker threads?
>>>
>>> - I see all worker threads running in one single JVM. Is this because I
>>> did not start  slaves separately and connect it to a single master cluster
>>> manager. If I had done that then each worker would have run in its own JVM.
>>>
>>> - What is the relationship between worker and executor. Can a worker
>>> have more than one executors? If yes then how do we configure that. Does
>>> all executor run in the worker JVM and are independent threads.
>>>
>>> I suppose that is all for now. Would appreciate any response.Will add
>>> followup questions if any.
>>>
>>> Thanks
>>> Sachin
>>>
>>>
>>>
>>>
>>
>>
>> --
>> *___*
>> Quant | Engineer | Boy
>> *___*
>> *blog*:http://litaotao.github.io
>> 
>> *github*: www.github.com/litaotao
>>
>
>


-- 
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io

*github*: www.github.com/litaotao


Re: Optimize filter operations with sorted data

2016-07-21 Thread Chanh Le
You can check in spark UI or in output of spark application.
How many stages and tasks before you partition and after.
Also compare the run time.

Regards,
Chanh

On Thu, Jul 7, 2016 at 6:40 PM, tan shai  wrote:

> How can you verify that it is loading only the part of time and network in
> filter ?
>
> 2016-07-07 11:58 GMT+02:00 Chanh Le :
>
>> Hi Tan,
>> It depends on how data organise and what your filter is.
>> For example in my case: I store data by partition by field time and
>> network_id. If I filter by time or network_id or both and with other field
>> Spark only load part of time and network in filter then filter the rest.
>>
>>
>>
>> > On Jul 7, 2016, at 4:43 PM, Ted Yu  wrote:
>> >
>> > Does the filter under consideration operate on sorted column(s) ?
>> >
>> > Cheers
>> >
>> >> On Jul 7, 2016, at 2:25 AM, tan shai  wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a sorted dataframe, I need to optimize the filter operations.
>> >> How does Spark performs filter operations on sorted dataframe?
>> >>
>> >> It is scanning all the data?
>> >>
>> >> Many thanks.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>
>


Where is the SparkSQL Specification?

2016-07-21 Thread Linyuxin
Hi All
Newbee here.
My spark version is 1.5.1

And I want to know how can I find the Specification of Spark SQL to find out 
that if it is supported ‘a like %b_xx’ or other sql syntax


Re: calculate time difference between consecutive rows

2016-07-21 Thread Jacek Laskowski
Hi,

What was the code you tried? You should use the built-in window
aggregates (windows) functions or create one yourself. I haven't tried
lag before (and don't think it's what you need really).

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 21, 2016 at 6:27 AM, Divya Gehlot  wrote:
> I have a dataset of time as shown below :
> Time1
> 07:30:23
> 07:34:34
> 07:38:23
> 07:39:12
> 07:45:20
>
> I need to find the diff between two consecutive rows
> I googled and found the lag function in spark helps in finding it .
> but its  giving me null in the result set.
>
> Would really appreciate the help.
>
>
> Thanks,
> Divya
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: write and call UDF in spark dataframe

2016-07-21 Thread Jacek Laskowski
On Thu, Jul 21, 2016 at 5:53 AM, Mich Talebzadeh
 wrote:
> something similar

Is this going to be in Scala?

> def ChangeToDate (word : String) : Date = {
>   //return
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(word,"dd/MM/"),"-MM-dd"))
>   val d1 = Date.valueOf(ReverseDate(word))
>   return d1
> }
> sqlContext.udf.register("ChangeToDate", ChangeToDate(_:String))

then...please use lowercase method names and *no* return please ;-)

BTW, no sqlContext as of Spark 2.0. Sorry.../me smiling nicely

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: write and call UDF in spark dataframe

2016-07-21 Thread Jacek Laskowski
On Thu, Jul 21, 2016 at 4:53 AM, Divya Gehlot  wrote:

> To be very specific I am looking for UDFs syntax for example which takes
> String as parameter and returns integer .. how do we define the return type

val f: String => Int = ???
val myUDF = udf(f)

or

val myUDF = udf[String, Int] { ??? }

or

val myUDF = udf { (s: String) => ??? }

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: write and call UDF in spark dataframe

2016-07-21 Thread Jacek Laskowski
On Wed, Jul 20, 2016 at 1:22 PM, Rishabh Bhardwaj  wrote:

> val new_df = df.select(from_unixtime($"time").as("newtime"))

or better yet using tick (less typing and more prose than code :))

df.select(from_unixtime('time) as "newtime")

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2016-07-21 Thread Kath Gallagher


dunnhumby limited is a limited company registered in England and Wales with 
registered number 02388853 and VAT registered number 927 5871 83. Our 
registered office is at Aurora House, 71-75 Uxbridge Road, London W5 5SL. The 
contents of this message and any attachments to it are confidential and may be 
legally privileged. If you have received this message in error you should 
delete it from your system immediately and advise the sender. dunnhumby may 
monitor and record all emails. The views expressed in this email are those of 
the sender and not those of dunnhumby.


Re: Re: run spark apps in linux crontab

2016-07-21 Thread Mich Talebzadeh
One more thing.

If you run a file interactively and you are interested in capturing the
output in a file plus seeing the output on the screen,  you can use* tee -a
*

ENVFILE=$HOME/dba/bin/environment.ksh
if [[ -f $ENVFILE ]]
then
. $ENVFILE
else
echo "Abort: $0 failed. No environment file ( $ENVFILE ) found"
exit 1
fi
#
FILE_NAME=`basename $0 .ksh`
LOG_FILE=${LOGDIR}/${FILE_NAME}.log
[ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}

echo `date` " ""=== Started to create oraclehadoop.sales ===" *|
tee -a  ${LOG_FILE}*

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 July 2016 at 07:53,  wrote:

> got it, I've added it into my notes.
> thank you
>
>
> 
>
> Thanks&Best regards!
> San.Luo
>
> - 原始邮件 -
> 发件人:Mich Talebzadeh 
> 收件人:Chanh Le 
> 抄送人:罗辉 , focus , user <
> user@spark.apache.org>
> 主题:Re: run spark apps in linux crontab
> 日期:2016年07月21日 12点01分
>
> you should source the environment file before or in the file. for example
> this one is ksh type
>
> 0,5,10,15,20,25,30,35,40,45,50,55 * * * *
> (/home/hduser/dba/bin/send_messages_to_Kafka.ksh >
> /var/tmp/send_messages_to_Kafka.err 2>&1)
>
> in that shell it sources the environment file
>
> #
> # Main Section
> #
> ENVFILE=/home/hduser/.kshrc
> if [[ -f $ENVFILE ]]
> then
> . $ENVFILE
> else
> echo "Abort: $0 failed. No environment file ( $ENVFILE ) found"
> exit 1
> fi
>
> Or simply in your case say
>
> 0,5,10,15,20,25,30,35,40,45,50,55 * * * *  
> *(/home/hduser/.bashrc;*$SPARK_HOME/bin/spark-submit...
> > /var/tmp/somefile 2>&1)
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 July 2016 at 04:38, Chanh Le  wrote:
>
> you should you use command.sh | tee file.log
>
> On Jul 21, 2016, at 10:36 AM,  
> wrote:
>
>
> thank you focus, and all.
> this problem solved by adding a line ". /etc/profile" in my shell.
>
>
> 
>
> Thanks&Best regards!
> San.Luo
>
> - 原始邮件 -
> 发件人:"focus" 
> 收件人:"luohui20001" , "user@spark.apache.org" <
> user@spark.apache.org>
> 主题:Re:run spark apps in linux crontab
> 日期:2016年07月20日 18点11分
>
> Hi, I just meet this problem, too! The reason is crontab runtime doesn't
> have the variables you defined, such as $SPARK_HOME.
> I defined the $SPARK_HOME and other variables in /etc/profile like this:
>
> export $MYSCRIPTS=/opt/myscripts
> export $SPARK_HOME=/opt/spark
>
> then, in my crontab job script daily_job.sh
>
> #!/bin/sh
>
> . /etc/profile
>
> $SPARK_HOME/bin/spark-submit $MYSCRIPTS/fix_fh_yesterday.py
>
> then, in crontab -e
>
> 0 8 * * * /home/user/daily_job.sh
>
> hope this helps~
>
>
>
>
> -- Original --
> *From:* "luohui20001";
> *Date:* 2016年7月20日(星期三) 晚上6:00
> *To:* "user@spark.apache.org";
> *Subject:* run spark apps in linux crontab
>
> hi guys:
>   I add a spark-submit job into my Linux crontab list by the means
> below ,however none of them works. If I change it to a normal shell script,
> it is ok. I don't quite understand why. I checked the 8080 web ui of my
> spark cluster, no job submitted, and there is not messages in
> /home/hadoop/log.
>   Any idea is welcome.
>
> [hadoop@master ~]$ crontab -e
> 1.
> 22 21 * * * sh /home/hadoop/shellscripts/run4.sh > /home/hadoop/log
>
> and in run4.sh,it wrote:
> $SPARK_HOME/bin/spark-submit --class com.abc.myclass
> --total-executor-cores 10 --jars $SPARK_HOME/lib/MyDep.jar
> $SPARK_HOME/MyJar.jar  > /home/hadoop/log
>
> 2.
> 22 21 * * * $SPARK_HOME/bin/spark-submit --class com.abc.myclass
> --total-executor-cores 10 --jars $SPARK_HOME/lib/MyDep.jar
> $SPARK_HOME/MyJar.jar  > /home/hadoop/log
>
> 3.
> 22 21 * * * /usr/lib/spark/bin/spark-submit --class com.abc.myclass
> --total-executor-cores 10 --jars /usr/lib/spark/lib/MyDep.jar
> /usr/lib/spark/MyJar.jar  > /home/hadoop/log
>
> 4.
> 22 21 * * * hadoop /usr/lib/spark/bin

Re: Understanding Spark UI DAGs

2016-07-21 Thread Jacek Laskowski
On Thu, Jul 21, 2016 at 2:56 AM, C. Josephson  wrote:

> I just started looking at the DAG for a Spark Streaming job, and had a
> couple of questions about it (image inline).
>
> 1.) What do the numbers in brackets mean, e.g. PythonRDD[805]?
>

Every RDD has its identifier (as id attribute) within a SparkContext (which
is the broadest scope an RDD can belong to). In this case, it means you've
already created 806 RDDs (counting from 0).


> 2.) What code is "RDD at PythonRDD.scala:43" referring to? Is there any
> way to tie this back to lines of code we've written in pyspark?
>

It's called a CallSite that shows where the line comes from. You can see
the code yourself given the python file and the line number.

Jacek