Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
Thanks Marcin,

That seems to be the case. It explains why there is no documentation on this 
part too!

To be specific, where exactly should spark.authenticate be set to true?

Many thanks,

Gerry

> On 8 Dec 2016, at 08:46, Marcin Pastecki  wrote:
> 
> My understanding is that the token generation is handled by Spark itself as 
> long as you were authenticated in Kerberos when submitting the job and 
> spark.authenticate is set to true.
> 
> --keytab and --principal options should be used for "long" running job, when 
> you may need to do ticket renewal. Spark will handle it then. I may be wrong 
> though.
> 
> I guess it gets even more complicated if you need to access other secured 
> service from Spark like hbase or Phoenix, but i guess this is for another 
> discussion.
> 
> Regards,
> Marcin
> 
> 
> On Thu, Dec 8, 2016, 08:40 Gerard Casey  > wrote:
> I just read an interesting comment on cloudera:
> 
> What does it mean by “when the job is submitted,and you have a kinit, you 
> will have TOKEN to access HDFS, you would need to pass that on, or the 
> KERBEROS ticket” ?
> 
> Reference 
> 
>  and full quote:
> 
> In a cluster which is kerberised there is no SIMPLE authentication. Make sure 
> that you have run kinit before you run the application.
> Second thing to check: In your application you need to do the right thing and 
> either pass on the TOKEN or a KERBEROS ticket.
> When the job is submitted, and you have done a kinit, you will have TOKEN to 
> access HDFS you would need to pass that on, or the KERBEROS ticket.
> You will need to handle this in your code. I can not see exactly what you are 
> doing at that point in the startup of your code but any HDFS access will 
> require a TOKEN or KERBEROS ticket.
>  
> Cheers,
> Wilfred
> 
>> On 8 Dec 2016, at 08:35, Gerard Casey > > wrote:
>> 
>> Thanks Marcelo.
>> 
>> I’ve completely removed it. Ok - even if I read/write from HDFS?
>> 
>> Trying to the SparkPi example now
>> 
>> G
>> 
>>> On 7 Dec 2016, at 22:10, Marcelo Vanzin >> > wrote:
>>> 
>>> Have you removed all the code dealing with Kerberos that you posted?
>>> You should not be setting those principal / keytab configs.
>>> 
>>> Literally all you have to do is login with kinit then run spark-submit.
>>> 
>>> Try with the SparkPi example for instance, instead of your own code.
>>> If that doesn't work, you have a configuration issue somewhere.
>>> 
>>> On Wed, Dec 7, 2016 at 1:09 PM, Gerard Casey >> > wrote:
 Thanks.
 
 I’ve checked the TGT, principal and key tab. Where to next?!
 
> On 7 Dec 2016, at 22:03, Marcelo Vanzin  > wrote:
> 
> On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey  > wrote:
>> Can anyone point me to a tutorial or a run through of how to use Spark 
>> with
>> Kerberos? This is proving to be quite confusing. Most search results on 
>> the
>> topic point to what needs inputted at the point of `sparks submit` and 
>> not
>> the changes needed in the actual src/main/.scala file
> 
> You don't need to write any special code to run Spark with Kerberos.
> Just write your application normally, and make sure you're logged in
> to the KDC (i.e. "klist" shows a valid TGT) before running your app.
> 
> 
> --
> Marcelo
 
>>> 
>>> 
>>> 
>>> -- 
>>> Marcelo
>> 
> 



How to clean the cache when i do performance test in spark

2016-12-07 Thread Zhang, Liyun
Hi all:
   When I test my spark application, I found that the second 
round(application_1481153226569_0002) is more faster than first 
round(application_1481153226569_0001).  Actually the configuration is same. I 
guess the second round is improved a lot by cache. So how can I clean the cache?




[cid:image002.png@01D2516A.5194DFA0]

Best Regards
Kelly Zhang/Zhang,Liyun



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Marcin Pastecki
My understanding is that the token generation is handled by Spark itself as
long as you were authenticated in Kerberos when submitting the job and
spark.authenticate is set to true.

--keytab and --principal options should be used for "long" running job,
when you may need to do ticket renewal. Spark will handle it then. I may be
wrong though.

I guess it gets even more complicated if you need to access other secured
service from Spark like hbase or Phoenix, but i guess this is for another
discussion.

Regards,
Marcin

On Thu, Dec 8, 2016, 08:40 Gerard Casey  wrote:

> I just read an interesting comment on cloudera:
>
> What does it mean by “when the job is submitted,and you have a kinit, you
> will have TOKEN to access HDFS, you would need to pass that on, or the
> KERBEROS ticket” ?
>
> Reference
> 
>  and
> full quote:
>
> In a cluster which is kerberised there is no SIMPLE authentication. Make
> sure that you have run kinit before you run the application.
> Second thing to check: In your application you need to do the right thing
> and either pass on the TOKEN or a KERBEROS ticket.
> When the job is submitted, and you have done a kinit, you will have TOKEN
> to access HDFS you would need to pass that on, or the KERBEROS ticket.
> You will need to handle this in your code. I can not see exactly what you
> are doing at that point in the startup of your code but any HDFS access
> will require a TOKEN or KERBEROS ticket.
>
>
> Cheers,
> Wilfred
>
> On 8 Dec 2016, at 08:35, Gerard Casey  wrote:
>
> Thanks Marcelo.
>
> I’ve completely removed it. Ok - even if I read/write from HDFS?
>
> Trying to the SparkPi example now
>
> G
>
> On 7 Dec 2016, at 22:10, Marcelo Vanzin  wrote:
>
> Have you removed all the code dealing with Kerberos that you posted?
> You should not be setting those principal / keytab configs.
>
> Literally all you have to do is login with kinit then run spark-submit.
>
> Try with the SparkPi example for instance, instead of your own code.
> If that doesn't work, you have a configuration issue somewhere.
>
> On Wed, Dec 7, 2016 at 1:09 PM, Gerard Casey 
> wrote:
>
> Thanks.
>
> I’ve checked the TGT, principal and key tab. Where to next?!
>
> On 7 Dec 2016, at 22:03, Marcelo Vanzin  wrote:
>
> On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey 
> wrote:
>
> Can anyone point me to a tutorial or a run through of how to use Spark with
> Kerberos? This is proving to be quite confusing. Most search results on the
> topic point to what needs inputted at the point of `sparks submit` and not
> the changes needed in the actual src/main/.scala file
>
>
> You don't need to write any special code to run Spark with Kerberos.
> Just write your application normally, and make sure you're logged in
> to the KDC (i.e. "klist" shows a valid TGT) before running your app.
>
>
> --
> Marcelo
>
>
>
>
>
> --
> Marcelo
>
>
>
>


Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
I just read an interesting comment on cloudera:

What does it mean by “when the job is submitted,and you have a kinit, you will 
have TOKEN to access HDFS, you would need to pass that on, or the KERBEROS 
ticket” ?

Reference 

 and full quote:

In a cluster which is kerberised there is no SIMPLE authentication. Make sure 
that you have run kinit before you run the application.
Second thing to check: In your application you need to do the right thing and 
either pass on the TOKEN or a KERBEROS ticket.
When the job is submitted, and you have done a kinit, you will have TOKEN to 
access HDFS you would need to pass that on, or the KERBEROS ticket.
You will need to handle this in your code. I can not see exactly what you are 
doing at that point in the startup of your code but any HDFS access will 
require a TOKEN or KERBEROS ticket.
 
Cheers,
Wilfred

> On 8 Dec 2016, at 08:35, Gerard Casey  wrote:
> 
> Thanks Marcelo.
> 
> I’ve completely removed it. Ok - even if I read/write from HDFS?
> 
> Trying to the SparkPi example now
> 
> G
> 
>> On 7 Dec 2016, at 22:10, Marcelo Vanzin > > wrote:
>> 
>> Have you removed all the code dealing with Kerberos that you posted?
>> You should not be setting those principal / keytab configs.
>> 
>> Literally all you have to do is login with kinit then run spark-submit.
>> 
>> Try with the SparkPi example for instance, instead of your own code.
>> If that doesn't work, you have a configuration issue somewhere.
>> 
>> On Wed, Dec 7, 2016 at 1:09 PM, Gerard Casey > > wrote:
>>> Thanks.
>>> 
>>> I’ve checked the TGT, principal and key tab. Where to next?!
>>> 
 On 7 Dec 2016, at 22:03, Marcelo Vanzin > wrote:
 
 On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey > wrote:
> Can anyone point me to a tutorial or a run through of how to use Spark 
> with
> Kerberos? This is proving to be quite confusing. Most search results on 
> the
> topic point to what needs inputted at the point of `sparks submit` and not
> the changes needed in the actual src/main/.scala file
 
 You don't need to write any special code to run Spark with Kerberos.
 Just write your application normally, and make sure you're logged in
 to the KDC (i.e. "klist" shows a valid TGT) before running your app.
 
 
 --
 Marcelo
>>> 
>> 
>> 
>> 
>> -- 
>> Marcelo
> 



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
Thanks Marcelo.

I’ve completely removed it. Ok - even if I read/write from HDFS?

Trying to the SparkPi example now

G

> On 7 Dec 2016, at 22:10, Marcelo Vanzin  wrote:
> 
> Have you removed all the code dealing with Kerberos that you posted?
> You should not be setting those principal / keytab configs.
> 
> Literally all you have to do is login with kinit then run spark-submit.
> 
> Try with the SparkPi example for instance, instead of your own code.
> If that doesn't work, you have a configuration issue somewhere.
> 
> On Wed, Dec 7, 2016 at 1:09 PM, Gerard Casey  > wrote:
>> Thanks.
>> 
>> I’ve checked the TGT, principal and key tab. Where to next?!
>> 
>>> On 7 Dec 2016, at 22:03, Marcelo Vanzin  wrote:
>>> 
>>> On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey  
>>> wrote:
 Can anyone point me to a tutorial or a run through of how to use Spark with
 Kerberos? This is proving to be quite confusing. Most search results on the
 topic point to what needs inputted at the point of `sparks submit` and not
 the changes needed in the actual src/main/.scala file
>>> 
>>> You don't need to write any special code to run Spark with Kerberos.
>>> Just write your application normally, and make sure you're logged in
>>> to the KDC (i.e. "klist" shows a valid TGT) before running your app.
>>> 
>>> 
>>> --
>>> Marcelo
>> 
> 
> 
> 
> -- 
> Marcelo



Re: Monitoring the User Metrics for a long running Spark Job

2016-12-07 Thread Sonal Goyal
You can try updating metrics.properties for the sink of your choice. In our
case, we add the following for getting application metrics in JSON format
using http

*.sink.reifier.class= org.apache.spark.metrics.sink.MetricsServlet

Here, we have defined the sink with name reifier and its class is the
MetricsServlet class. Then you can poll /metrics/applications/json

Take a look at https://github.com/hammerlab/spark-json-relay if it serves
your need.

Thanks,
Sonal
Nube Technologies 





On Wed, Dec 7, 2016 at 1:10 AM, Chawla,Sumit  wrote:

> Any pointers on this?
>
> Regards
> Sumit Chawla
>
>
> On Mon, Dec 5, 2016 at 8:30 PM, Chawla,Sumit 
> wrote:
>
>> An example implementation i found is : https://github.com/groupon/s
>> park-metrics
>>
>> Anyone has any experience using this?  I am more interested in something
>> for Pyspark specifically.
>>
>> The above link pointed to - https://github.com/apache/sp
>> ark/blob/master/conf/metrics.properties.template.  I need to spend some
>> time reading it, but any quick pointers will be appreciated.
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Mon, Dec 5, 2016 at 8:17 PM, Chawla,Sumit 
>> wrote:
>>
>>> Hi Manish
>>>
>>> I am specifically looking for something similar to following:
>>>
>>>  https://ci.apache.org/projects/flink/flink-docs-release-1.1
>>> /apis/common/index.html#accumulators--counters.
>>>
>>> Flink has this concept of Accumulators, where user can keep its custom
>>> counters etc.  While the application is executing these counters are
>>> queryable through REST API provided by Flink Monitoring Backend.  This way
>>> you don't have to wait for the program to complete.
>>>
>>>
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>> On Mon, Dec 5, 2016 at 5:53 PM, manish ranjan 
>>> wrote:
>>>
 http://spark.apache.org/docs/latest/monitoring.html

 You can even install tools like  dstat
 , iostat
 , and iotop
 , *collectd*  can provide
 fine-grained profiling on individual nodes.

 If you are using Mesos as Resource Manager , mesos exposes metrics as
 well for the running job.

 Manish

 ~Manish



 On Mon, Dec 5, 2016 at 4:17 PM, Chawla,Sumit 
 wrote:

> Hi All
>
> I have a long running job which takes hours and hours to process
> data.  How can i monitor the operational efficency of this job?  I am
> interested in something like Storm\Flink style User metrics/aggregators,
> which i can monitor while my job is running.  Using these metrics i want 
> to
> monitor, per partition performance in processing items.  As of now, only
> way for me to get these metrics is when the job finishes.
>
> One possibility is that spark can flush the metrics to external system
> every few seconds, and thus use  an external system to monitor these
> metrics.  However, i wanted to see if the spark supports any such use case
> OOB.
>
>
> Regards
> Sumit Chawla
>
>

>>>
>>
>


Re: WARN util.NativeCodeLoader

2016-12-07 Thread Sean Owen
You can ignore it. You can also install the native libs in question but
it's just a minor accelerator.

On Thu, Dec 8, 2016 at 2:36 PM baipeng  wrote:

> Hi ALL
>
> I’m new to Spark.When I execute spark-shell, the first line is as follows
>  WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
> platform... using builtin-java classes where applicable.
> Can someone tell me how to solve the problem?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


WARN util.NativeCodeLoader

2016-12-07 Thread baipeng
Hi ALL

I’m new to Spark.When I execute spark-shell, the first line is as follows
 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable.
Can someone tell me how to solve the problem?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unscribe

2016-12-07 Thread smith_666


Unsubscribe

2016-12-07 Thread Roger Holenweger



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2016-12-07 Thread Prashant Singh Thakur


Best Regards,
Prashant Thakur
Work : 6046
Mobile: +91-9740266522









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Unsubscribe

2016-12-07 Thread Ajit Jaokar


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2016-12-07 Thread Kranthi Gmail


-- 
Kranthi

PS: Sent from mobile, pls excuse the brevity and typos.

> On Dec 7, 2016, at 8:05 PM, Siddhartha Khaitan  
> wrote:
> 
> 


Unsubscribe

2016-12-07 Thread Siddhartha Khaitan



unsubscribe

2016-12-07 Thread Ajith Jose



Re: Not per-key state in spark streaming

2016-12-07 Thread Anty Rao
On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao  wrote:

> Hi
> I'm new to Spark. I'm doing some research to see if spark streaming can
> solve my problem. I don't want to keep per-key state,b/c my data set is
> very huge and keep a little longer time, it not viable to keep all per key
> state in memory.Instead, i want to have a bloom filter based state. Does it
> possible to achieve this in Spark streaming.
>
> Is it possible to achieve this by extending Spark API?

> --
> Anty Rao
>



-- 
Anty Rao


Re: Not per-key state in spark streaming

2016-12-07 Thread Anty Rao
@Daniel
Thanks for your reply. I will try it.

On Wed, Dec 7, 2016 at 8:47 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi Anty,
> What you could do is keep in the state only the existence of a key and
> when necessary pull it from a secondary state store like HDFS or HBASE.
>
> Daniel
>
> On Wed, Dec 7, 2016 at 1:42 PM, Anty Rao  wrote:
>
>> Hi
>> I'm new to Spark. I'm doing some research to see if spark streaming can
>> solve my problem. I don't want to keep per-key state,b/c my data set is
>> very huge and keep a little longer time, it not viable to keep all per key
>> state in memory.Instead, i want to have a bloom filter based state. Does it
>> possible to achieve this in Spark streaming.
>>
>> --
>> Anty Rao
>>
>
>


-- 
Anty Rao


Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Iman Mohtashemi
yes exactly. I run mine fine in Eclipse but when I run it from a
corresponding jar I get the same error!

On Wed, Dec 7, 2016 at 5:04 PM Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> I believe, it's not about the location (i.e., local machine or HDFS) but
> it's all about the format of the input file. For example, I am getting the
> following error while trying to read an input file in libsvm format:
>
> *Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find data  source: libsvm. *
>
> The application works fine on Eclipse. However, while packaging the
> corresponding jar file, I am getting the above error which is really weird!
>
>
>
> Regards,
> _
> *Md. Rezaul Karim* BSc, MSc
>
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 7 December 2016 at 23:39, Iman Mohtashemi 
> wrote:
>
> No but I tried that too and still didn't work. Where are the files being
> read from? From the local machine or HDFS? Do I need to get the files to
> HDFS first? In Eclipse I just point to the location of the directory?
>
> On Wed, Dec 7, 2016 at 3:34 PM Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
> Hi,
>
> You should prepare your jar file (from your Spark application written in
> Java) with all the necessary dependencies. You can create a Maven project
> on Eclipse by specifying the dependencies in a Maven friendly pom.xml file.
>
> For building the jar with the dependencies and *main class (since you are
> getting the **ClassNotFoundException)* your pom.xml should contain the
> following in the *build *tag (example main class is marked in Red color):
>
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-eclipse-plugin
> 2.9
> 
> true
> false
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-compiler-plugin
> 3.5.1
> 
> ${jdk.version}
> ${jdk.version}
> 
> 
> 
> org.apache.maven.plugins
> maven-shade-plugin
> 2.4.3
> 
> true
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-assembly-plugin
> 2.4.1
> 
> 
> 
>
> jar-with-dependencies
> 
> 
> 
> 
>
> com.example.RandomForest.SongPrediction
> 
> 
>
> 
>
> oozie.launcher.mapreduce.job.user.classpath.first
> true
> 
>
> 
> 
> 
> make-assembly
> 
> package
> 
> single
> 
> 
> 
> 
> 
> 
>
>
> An example pom.xml file has been attached for your reference. Feel free to
> reuse it.
>
>
> Regards,
> _
> *Md. Rezaul Karim,* BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 7 December 2016 at 23:18, im281  wrote:
>
> Hello,
> I have a simple word count example in Java and I can run this in Eclipse
> (code at the bottom)
>
> I then create a jar file from it and try to run it from the cmd
>
>
> java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt
>
> But I get this error?
>
> I think the main error is:
> *Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find
> data source: text*
>
> Any advise on how to run this jar file in spark would be appreciated
>
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 16/12/07 15:16:41 INFO SparkContext: Running Spark version 2.0.2
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: 

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Md. Rezaul Karim
I believe, it's not about the location (i.e., local machine or HDFS) but
it's all about the format of the input file. For example, I am getting the
following error while trying to read an input file in libsvm format:

*Exception in thread "main" java.lang.ClassNotFoundException: Failed to
find data  source: libsvm. *

The application works fine on Eclipse. However, while packaging the
corresponding jar file, I am getting the above error which is really weird!



Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


On 7 December 2016 at 23:39, Iman Mohtashemi 
wrote:

> No but I tried that too and still didn't work. Where are the files being
> read from? From the local machine or HDFS? Do I need to get the files to
> HDFS first? In Eclipse I just point to the location of the directory?
>
> On Wed, Dec 7, 2016 at 3:34 PM Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi,
>>
>> You should prepare your jar file (from your Spark application written in
>> Java) with all the necessary dependencies. You can create a Maven project
>> on Eclipse by specifying the dependencies in a Maven friendly pom.xml file.
>>
>> For building the jar with the dependencies and *main class (since you
>> are getting the **ClassNotFoundException)* your pom.xml should contain
>> the following in the *build *tag (example main class is marked in Red
>> color):
>>
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-eclipse-plugin
>> 2.9
>> 
>> true
>> false
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-compiler-plugin
>> 3.5.1
>> 
>> ${jdk.version}
>> ${jdk.version}
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-shade-plugin
>> 2.4.3
>> 
>> true
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-assembly-plugin
>> 2.4.1
>> 
>> 
>> 
>> jar-with-
>> dependencies
>> 
>> 
>> 
>> 
>> com.example.
>> RandomForest.SongPrediction
>> 
>> 
>>
>> 
>> oozie.launcher.
>> mapreduce.job.user.classpath.first
>> true
>> 
>>
>> 
>> 
>> 
>> make-assembly
>> 
>> package
>> 
>> single
>> 
>> 
>> 
>> 
>> 
>> 
>>
>>
>> An example pom.xml file has been attached for your reference. Feel free
>> to reuse it.
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim,* BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> 
>>
>> On 7 December 2016 at 23:18, im281  wrote:
>>
>> Hello,
>> I have a simple word count example in Java and I can run this in Eclipse
>> (code at the bottom)
>>
>> I then create a jar file from it and try to run it from the cmd
>>
>>
>> java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt
>>
>> But I get this error?
>>
>> I think the main error is:
>> *Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>> find
>> data source: text*
>>
>> Any advise on how to run this jar file in spark would be appreciated
>>
>>
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 16/12/07 15:16:41 INFO SparkContext: Running Spark version 2.0.2
>> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls to: Owner
>> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls to: Owner
>> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls groups to:
>> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls groups to:
>> 16/12/07 15:16:42 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users  with view permissions: Set(Owner);
>> groups
>> with view 

StreamingContext.textFileStream(...)

2016-12-07 Thread muthu
Hello there,

I am trying to find a way to get the file-name of the current file being
processed from the monitored directory for HDFS...

Meaning, 

Let's say...
val lines = ssc.textFileStream("my_hdfs_location")
lines.map { (row: String) => ... } //No access to file-name here
Also, let's say I have some RDD created in time
t1t2t3t3...
Let's say in every unit of time, I would like to perform foldLeft() like
operation these DStreams as they arrive. Meaning, let's say before t1, I
would have an Inital/Zero value (or load value from some persistence store)
and after every interval t1, t2, ... I would like this value to update into
the accumulator of the fold (as commonly seen in Scala collection's
foldLeft()).  
reduceByKeyAndWindow() allows me to do something similar, but I have to
perform reduce during every iteration. But with foldLeft() like semantics I
have to only combine the previously combined result with the result from the
current time tn.

Please advice,
Muthu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-tp28183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Iman Mohtashemi
No but I tried that too and still didn't work. Where are the files being
read from? From the local machine or HDFS? Do I need to get the files to
HDFS first? In Eclipse I just point to the location of the directory?

On Wed, Dec 7, 2016 at 3:34 PM Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> Hi,
>
> You should prepare your jar file (from your Spark application written in
> Java) with all the necessary dependencies. You can create a Maven project
> on Eclipse by specifying the dependencies in a Maven friendly pom.xml file.
>
> For building the jar with the dependencies and *main class (since you are
> getting the **ClassNotFoundException)* your pom.xml should contain the
> following in the *build *tag (example main class is marked in Red color):
>
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-eclipse-plugin
> 2.9
> 
> true
> false
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-compiler-plugin
> 3.5.1
> 
> ${jdk.version}
> ${jdk.version}
> 
> 
> 
> org.apache.maven.plugins
> maven-shade-plugin
> 2.4.3
> 
> true
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-assembly-plugin
> 2.4.1
> 
> 
> 
>
> jar-with-dependencies
> 
> 
> 
> 
>
> com.example.RandomForest.SongPrediction
> 
> 
>
> 
>
> oozie.launcher.mapreduce.job.user.classpath.first
> true
> 
>
> 
> 
> 
> make-assembly
> 
> package
> 
> single
> 
> 
> 
> 
> 
> 
>
>
> An example pom.xml file has been attached for your reference. Feel free to
> reuse it.
>
>
> Regards,
> _
> *Md. Rezaul Karim,* BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 7 December 2016 at 23:18, im281  wrote:
>
> Hello,
> I have a simple word count example in Java and I can run this in Eclipse
> (code at the bottom)
>
> I then create a jar file from it and try to run it from the cmd
>
>
> java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt
>
> But I get this error?
>
> I think the main error is:
> *Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find
> data source: text*
>
> Any advise on how to run this jar file in spark would be appreciated
>
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 16/12/07 15:16:41 INFO SparkContext: Running Spark version 2.0.2
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(Owner);
> groups
> with view permissions: Set(); users  with modify permissions: Set(Owner);
> groups with modify permissions: Set()
> 16/12/07 15:16:44 INFO Utils: Successfully started service 'sparkDriver' on
> port 10211.
> 16/12/07 15:16:44 INFO SparkEnv: Registering MapOutputTracker
> 16/12/07 15:16:44 INFO SparkEnv: Registering BlockManagerMaster
> 16/12/07 15:16:44 INFO DiskBlockManager: Created local directory at
>
> C:\Users\Owner\AppData\Local\Temp\blockmgr-b4b1960b-08fc-44fd-a75e-1a0450556873
> 16/12/07 15:16:44 INFO MemoryStore: MemoryStore started with capacity
> 1984.5
> MB
> 16/12/07 15:16:45 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/12/07 15:16:45 INFO Utils: Successfully started service 'SparkUI' on
> port
> 4040.
> 16/12/07 15:16:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
> http://192.168.19.2:4040
> 16/12/07 15:16:45 INFO Executor: Starting executor ID driver on host
> localhost
> 16/12/07 15:16:45 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 10252.
> 16/12/07 

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Md. Rezaul Karim
Hi,

You should prepare your jar file (from your Spark application written in
Java) with all the necessary dependencies. You can create a Maven project
on Eclipse by specifying the dependencies in a Maven friendly pom.xml file.

For building the jar with the dependencies and *main class (since you are
getting the **ClassNotFoundException)* your pom.xml should contain the
following in the *build *tag (example main class is marked in Red color):





org.apache.maven.plugins
maven-eclipse-plugin
2.9

true
false




org.apache.maven.plugins
maven-compiler-plugin
3.5.1

${jdk.version}
${jdk.version}



org.apache.maven.plugins
maven-shade-plugin
2.4.3

true




org.apache.maven.plugins
maven-assembly-plugin
2.4.1



jar-with-dependencies





com.example.RandomForest.SongPrediction





oozie.launcher.mapreduce.job.user.classpath.first
true





make-assembly

package

single








An example pom.xml file has been attached for your reference. Feel free to
reuse it.


Regards,
_
*Md. Rezaul Karim,* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


On 7 December 2016 at 23:18, im281  wrote:

> Hello,
> I have a simple word count example in Java and I can run this in Eclipse
> (code at the bottom)
>
> I then create a jar file from it and try to run it from the cmd
>
>
> java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt
>
> But I get this error?
>
> I think the main error is:
> *Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find
> data source: text*
>
> Any advise on how to run this jar file in spark would be appreciated
>
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 16/12/07 15:16:41 INFO SparkContext: Running Spark version 2.0.2
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(Owner);
> groups
> with view permissions: Set(); users  with modify permissions: Set(Owner);
> groups with modify permissions: Set()
> 16/12/07 15:16:44 INFO Utils: Successfully started service 'sparkDriver' on
> port 10211.
> 16/12/07 15:16:44 INFO SparkEnv: Registering MapOutputTracker
> 16/12/07 15:16:44 INFO SparkEnv: Registering BlockManagerMaster
> 16/12/07 15:16:44 INFO DiskBlockManager: Created local directory at
> C:\Users\Owner\AppData\Local\Temp\blockmgr-b4b1960b-08fc-
> 44fd-a75e-1a0450556873
> 16/12/07 15:16:44 INFO MemoryStore: MemoryStore started with capacity
> 1984.5
> MB
> 16/12/07 15:16:45 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/12/07 15:16:45 INFO Utils: Successfully started service 'SparkUI' on
> port
> 4040.
> 16/12/07 15:16:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
> http://192.168.19.2:4040
> 16/12/07 15:16:45 INFO Executor: Starting executor ID driver on host
> localhost
> 16/12/07 15:16:45 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 10252.
> 16/12/07 15:16:45 INFO NettyBlockTransferService: Server created on
> 192.168.19.2:10252
> 16/12/07 15:16:45 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, 192.168.19.2, 10252)
> 16/12/07 15:16:45 INFO BlockManagerMasterEndpoint: Registering block
> manager
> 192.168.19.2:10252 with 1984.5 MB RAM, BlockManagerId(driver,
> 192.168.19.2,
> 10252)
> 16/12/07 15:16:45 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, 

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Gmail
Don't you need to provide your class name "JavaWordCount"?

Thanks,
Vasu. 

> On Dec 7, 2016, at 3:18 PM, im281  wrote:
> 
> Hello,
> I have a simple word count example in Java and I can run this in Eclipse
> (code at the bottom)
> 
> I then create a jar file from it and try to run it from the cmd
> 
> 
> java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt
> 
> But I get this error?
> 
> I think the main error is:
> *Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
> data source: text*
> 
> Any advise on how to run this jar file in spark would be appreciated
> 
> 
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 16/12/07 15:16:41 INFO SparkContext: Running Spark version 2.0.2
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(Owner); groups
> with view permissions: Set(); users  with modify permissions: Set(Owner);
> groups with modify permissions: Set()
> 16/12/07 15:16:44 INFO Utils: Successfully started service 'sparkDriver' on
> port 10211.
> 16/12/07 15:16:44 INFO SparkEnv: Registering MapOutputTracker
> 16/12/07 15:16:44 INFO SparkEnv: Registering BlockManagerMaster
> 16/12/07 15:16:44 INFO DiskBlockManager: Created local directory at
> C:\Users\Owner\AppData\Local\Temp\blockmgr-b4b1960b-08fc-44fd-a75e-1a0450556873
> 16/12/07 15:16:44 INFO MemoryStore: MemoryStore started with capacity 1984.5
> MB
> 16/12/07 15:16:45 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/12/07 15:16:45 INFO Utils: Successfully started service 'SparkUI' on port
> 4040.
> 16/12/07 15:16:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
> http://192.168.19.2:4040
> 16/12/07 15:16:45 INFO Executor: Starting executor ID driver on host
> localhost
> 16/12/07 15:16:45 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 10252.
> 16/12/07 15:16:45 INFO NettyBlockTransferService: Server created on
> 192.168.19.2:10252
> 16/12/07 15:16:45 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, 192.168.19.2, 10252)
> 16/12/07 15:16:45 INFO BlockManagerMasterEndpoint: Registering block manager
> 192.168.19.2:10252 with 1984.5 MB RAM, BlockManagerId(driver, 192.168.19.2,
> 10252)
> 16/12/07 15:16:45 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, 192.168.19.2, 10252)
> 16/12/07 15:16:46 WARN SparkContext: Use an existing SparkContext, some
> configuration may not take effect.
> 16/12/07 15:16:46 INFO SharedState: Warehouse path is
> 'file:/C:/Users/Owner/spark-warehouse'.
> Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
> data source: text. Please find packages at
> https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
>at
> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
>at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
>at
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
>at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
>at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>at
> org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:504)
>at
> org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:540)
>at
> org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:513)
>at JavaWordCount.main(JavaWordCount.java:57)
> Caused by: java.lang.ClassNotFoundException: text.DefaultSource
>at java.net.URLClassLoader.findClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
>at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
>at scala.util.Try$.apply(Try.scala:192)
>at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132)
>at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132)
>at scala.util.Try.orElse(Try.scala:84)
>at
> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:132)
>

Running spark from Eclipse and then Jar

2016-12-07 Thread im281
Hello,
I have a simple word count example in Java and I can run this in Eclipse
(code at the bottom)

I then create a jar file from it and try to run it from the cmd


java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt

But I get this error?

I think the main error is:
*Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
data source: text*

Any advise on how to run this jar file in spark would be appreciated


Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
16/12/07 15:16:41 INFO SparkContext: Running Spark version 2.0.2
16/12/07 15:16:42 INFO SecurityManager: Changing view acls to: Owner
16/12/07 15:16:42 INFO SecurityManager: Changing modify acls to: Owner
16/12/07 15:16:42 INFO SecurityManager: Changing view acls groups to:
16/12/07 15:16:42 INFO SecurityManager: Changing modify acls groups to:
16/12/07 15:16:42 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(Owner); groups
with view permissions: Set(); users  with modify permissions: Set(Owner);
groups with modify permissions: Set()
16/12/07 15:16:44 INFO Utils: Successfully started service 'sparkDriver' on
port 10211.
16/12/07 15:16:44 INFO SparkEnv: Registering MapOutputTracker
16/12/07 15:16:44 INFO SparkEnv: Registering BlockManagerMaster
16/12/07 15:16:44 INFO DiskBlockManager: Created local directory at
C:\Users\Owner\AppData\Local\Temp\blockmgr-b4b1960b-08fc-44fd-a75e-1a0450556873
16/12/07 15:16:44 INFO MemoryStore: MemoryStore started with capacity 1984.5
MB
16/12/07 15:16:45 INFO SparkEnv: Registering OutputCommitCoordinator
16/12/07 15:16:45 INFO Utils: Successfully started service 'SparkUI' on port
4040.
16/12/07 15:16:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://192.168.19.2:4040
16/12/07 15:16:45 INFO Executor: Starting executor ID driver on host
localhost
16/12/07 15:16:45 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 10252.
16/12/07 15:16:45 INFO NettyBlockTransferService: Server created on
192.168.19.2:10252
16/12/07 15:16:45 INFO BlockManagerMaster: Registering BlockManager
BlockManagerId(driver, 192.168.19.2, 10252)
16/12/07 15:16:45 INFO BlockManagerMasterEndpoint: Registering block manager
192.168.19.2:10252 with 1984.5 MB RAM, BlockManagerId(driver, 192.168.19.2,
10252)
16/12/07 15:16:45 INFO BlockManagerMaster: Registered BlockManager
BlockManagerId(driver, 192.168.19.2, 10252)
16/12/07 15:16:46 WARN SparkContext: Use an existing SparkContext, some
configuration may not take effect.
16/12/07 15:16:46 INFO SharedState: Warehouse path is
'file:/C:/Users/Owner/spark-warehouse'.
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
data source: text. Please find packages at
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
at
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at
org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:504)
at
org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:540)
at
org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:513)
at JavaWordCount.main(JavaWordCount.java:57)
Caused by: java.lang.ClassNotFoundException: text.DefaultSource
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
at scala.util.Try$.apply(Try.scala:192)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132)
at scala.util.Try.orElse(Try.scala:84)
at
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:132)
... 8 more
16/12/07 15:16:46 INFO SparkContext: Invoking stop() from shutdown hook
16/12/07 15:16:46 INFO SparkUI: Stopped Spark web UI at
http://192.168.19.2:4040
16/12/07 15:16:46 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
16/12/07 15:16:46 INFO MemoryStore: MemoryStore cleared

Pruning decision tree to create an optimal tree

2016-12-07 Thread Md. Rezaul Karim
Hi there,

Say, I have a deeper tree that needs to be pruned to create an optimal
tree. For example, in R it can be done using the *rpart *and *prune *function.


Is it possible to prune the MLlib-based decision tree while performing the
classification or regression?




Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Accessing classpath resources in Spark Shell

2016-12-07 Thread Michal Šenkýř

Hello everyone,

I recently encountered a situation where I needed to add a custom 
classpath resource to my driver and access it from an included library 
(specifically a configuration file for a custom Dataframe Reader).


I need to use it from both inside an application which I submit to the 
cluster using /spark-submit --deploy-mode cluster/ as well as from the 
spark shell when doing computations manually.


Adding it to the application was easy. I just distributed the file 
throughout the cluster using /--files / and added it to 
the driver's classpath using /--driver-class-path /. The file 
would then be obtainable using, for example, 
/Thread.currentThread.getContextClassLoader.getResourceAsStream//(...)/.


The problem was when I wanted to do the same thing in the Spark shell. I 
tried adding the the file to the classpath using the same way as well as 
/--driver-class-path /, etc., but I cannot access it in 
any way. Either using the library or directly from the shell.


How does the Spark shell's classloading facilities work and how should I 
solve this problem?


Thanks,
Michal



Driver/Executor Memory values during Unit Testing

2016-12-07 Thread Aleksander Eskilson
Hi there,

I've been trying to increase the spark.driver.memory and
spark.executor.memory during some unit tests. Most of the information I can
find about increasing memory for Spark is based on either flags to
spark-submit, or settings in the spark-defaults.conf file. Running unit
tests with Maven on both a local machine and on a Jenkins box, and editing
both the .conf file, and attempting to set the spark.driver.memory and
spark.executor.memory variables in a SparkConf object in the unit tests
@BeforeClass method, I still can't seem to change what the Storage Memory
of Executor is, it remains the same upon every execution when I check the
UI during the tests. When Spark is invoked on a local machine, and not
through spark-submit in the shell (as during unit test), are the memory
defaults computed some other way, perhaps based on JVM heap allocation
settings?

Best,
Alek


Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Marcelo Vanzin
Have you removed all the code dealing with Kerberos that you posted?
You should not be setting those principal / keytab configs.

Literally all you have to do is login with kinit then run spark-submit.

Try with the SparkPi example for instance, instead of your own code.
If that doesn't work, you have a configuration issue somewhere.

On Wed, Dec 7, 2016 at 1:09 PM, Gerard Casey  wrote:
> Thanks.
>
> I’ve checked the TGT, principal and key tab. Where to next?!
>
>> On 7 Dec 2016, at 22:03, Marcelo Vanzin  wrote:
>>
>> On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey  
>> wrote:
>>> Can anyone point me to a tutorial or a run through of how to use Spark with
>>> Kerberos? This is proving to be quite confusing. Most search results on the
>>> topic point to what needs inputted at the point of `sparks submit` and not
>>> the changes needed in the actual src/main/.scala file
>>
>> You don't need to write any special code to run Spark with Kerberos.
>> Just write your application normally, and make sure you're logged in
>> to the KDC (i.e. "klist" shows a valid TGT) before running your app.
>>
>>
>> --
>> Marcelo
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
Thanks.

I’ve checked the TGT, principal and key tab. Where to next?! 

> On 7 Dec 2016, at 22:03, Marcelo Vanzin  wrote:
> 
> On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey  
> wrote:
>> Can anyone point me to a tutorial or a run through of how to use Spark with
>> Kerberos? This is proving to be quite confusing. Most search results on the
>> topic point to what needs inputted at the point of `sparks submit` and not
>> the changes needed in the actual src/main/.scala file
> 
> You don't need to write any special code to run Spark with Kerberos.
> Just write your application normally, and make sure you're logged in
> to the KDC (i.e. "klist" shows a valid TGT) before running your app.
> 
> 
> -- 
> Marcelo


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Marcelo Vanzin
On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey  wrote:
> Can anyone point me to a tutorial or a run through of how to use Spark with
> Kerberos? This is proving to be quite confusing. Most search results on the
> topic point to what needs inputted at the point of `sparks submit` and not
> the changes needed in the actual src/main/.scala file

You don't need to write any special code to run Spark with Kerberos.
Just write your application normally, and make sure you're logged in
to the KDC (i.e. "klist" shows a valid TGT) before running your app.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
If your operations are idempotent, you should be able to just run a
totally separate job that looks for failed batches and does a kafkaRDD
to reprocess that batch.  C* probably isn't the first choice for what
is essentially a queue, but if the frequency of batches is relatively
low it probably doesn't matter.

That is indeed a weird stacktrace, did you investigate driver logs to
see if there was something else preceding it?

On Wed, Dec 7, 2016 at 2:41 PM, map reduced  wrote:
>> Personally I think forcing the stream to fail (e.g. check offsets in
>> downstream store and throw exception if they aren't as expected) is
>> the safest thing to do.
>
>
> I would think so too, but just for say 2-3 (sometimes just 1) failed batches
> in a whole day, I am trying to not kill the whole processing and restart.
>
> I am storing the offsets per batch and success/failure in a separate C*
> table - checkpointing was not an option due to it not working with
> application jar change etc.  Since I have access to the offsets, you think
> #2 or some variation of it may work?
>
> Btw, some of those failures I mentioned are strange, for instance (Spark
> 2.0.0 and spark-streaming-kafka-0-8_2.11):
>
> Job aborted due to stage failure: Task 173 in stage 92312.0 failed 10 times,
> most recent failure: Lost task 173.9 in stage 92312.0 (TID 27689025,
> 17.162.114.161): java.util.NoSuchElementException
>   at
> java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
>   at
> com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:102)
>   at
> com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:81)
>   at com.yammer.metrics.core.Histogram.update(Histogram.java:110)
>   at com.yammer.metrics.core.Timer.update(Timer.java:198)
>   at com.yammer.metrics.core.Timer.update(Timer.java:76)
>   at com.yammer.metrics.core.TimerContext.stop(TimerContext.java:31)
>   at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:36)
>   at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111)
>   at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
>   at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
>   at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>   at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110)
>   at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
>   at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:209)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>
>
> On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeninger  wrote:
>>
>> Personally I think forcing the stream to fail (e.g. check offsets in
>> downstream store and throw exception if they aren't as expected) is
>> the safest thing to do.
>>
>> If you proceed after a failure, you need a place to reliably record
>> the batches that failed for later processing.
>>
>> On Wed, Dec 7, 2016 at 1:46 PM, map reduced  wrote:
>> > Hi,
>> >
>> > I am trying to solve this problem - in my streaming flow, every day few
>> > jobs
>> > fail due to some (say kafka cluster maintenance etc, mostly unavoidable)
>> > reasons for few batches and resumes back to success.
>> > I want to reprocess those failed jobs programmatically (assume I have a
>> > way
>> > of getting start-end offsets for kafka topics for failed jobs). I was
>> > thinking of these options:
>> > 1) Somehow pause streaming job when it detects failing jobs - this seems
>> > not
>> > possible.
>> > 2) From driver - run additional processing to check every few minutes
>> > using
>> > driver rest api (/api/v1/applications...) what jobs have failed and
>> > submit
>> > batch jobs for those failed jobs
>> >
>> > 1 - doesn't seem to be possible, and I don't want to kill streaming
>> > context
>> > just for few failing batches to stop the job for some time and resume
>> > after
>> > few minutes.
>> > 2 - seems like a viable 

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread map reduced
>
> Personally I think forcing the stream to fail (e.g. check offsets in
> downstream store and throw exception if they aren't as expected) is
> the safest thing to do.


I would think so too, but just for say 2-3 (sometimes just 1) failed
batches in a whole day, I am trying to not kill the whole processing and
restart.

I am storing the offsets per batch and success/failure in a separate C*
table - checkpointing was not an option due to it not working with
application jar change etc.  Since I have access to the offsets, you think
#2 or some variation of it may work?

Btw, some of those failures I mentioned are strange, for instance (Spark
2.0.0 and spark-streaming-kafka-0-8_2.11):

Job aborted due to stage failure: Task 173 in stage 92312.0 failed 10
times, most recent failure: Lost task 173.9 in stage 92312.0 (TID
27689025, 17.162.114.161): java.util.NoSuchElementException
at 
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
at 
com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:102)
at 
com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:81)
at com.yammer.metrics.core.Histogram.update(Histogram.java:110)
at com.yammer.metrics.core.Timer.update(Timer.java:198)
at com.yammer.metrics.core.Timer.update(Timer.java:76)
at com.yammer.metrics.core.TimerContext.stop(TimerContext.java:31)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:36)
at 
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111)
at 
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
at 
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:209)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


On Wed, Dec 7, 2016 at 12:16 PM, Cody Koeninger  wrote:

> Personally I think forcing the stream to fail (e.g. check offsets in
> downstream store and throw exception if they aren't as expected) is
> the safest thing to do.
>
> If you proceed after a failure, you need a place to reliably record
> the batches that failed for later processing.
>
> On Wed, Dec 7, 2016 at 1:46 PM, map reduced  wrote:
> > Hi,
> >
> > I am trying to solve this problem - in my streaming flow, every day few
> jobs
> > fail due to some (say kafka cluster maintenance etc, mostly unavoidable)
> > reasons for few batches and resumes back to success.
> > I want to reprocess those failed jobs programmatically (assume I have a
> way
> > of getting start-end offsets for kafka topics for failed jobs). I was
> > thinking of these options:
> > 1) Somehow pause streaming job when it detects failing jobs - this seems
> not
> > possible.
> > 2) From driver - run additional processing to check every few minutes
> using
> > driver rest api (/api/v1/applications...) what jobs have failed and
> submit
> > batch jobs for those failed jobs
> >
> > 1 - doesn't seem to be possible, and I don't want to kill streaming
> context
> > just for few failing batches to stop the job for some time and resume
> after
> > few minutes.
> > 2 - seems like a viable option, but a little complicated, since even the
> > batch job can fail due to whatever reasons and I am back to tracking that
> > separately etc.
> >
> > Does anyone has faced this issue or have any suggestions?
> >
> > Thanks,
> > KP
>


Re: Spark streaming completed batches statistics

2016-12-07 Thread map reduced
Just keep in mind that rest-api needs to be called from driver ui endpoint
and not from Spark/Master UI.

On Wed, Dec 7, 2016 at 12:03 PM, Richard Startin  wrote:

> Ok it looks like I could reconstruct the logic in the Spark UI from the
> /jobs resource. Thanks.
>
>
> https://richardstartin.com/
>
>
> --
> *From:* map reduced 
> *Sent:* 07 December 2016 19:49
> *To:* Richard Startin
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark streaming completed batches statistics
>
> Have you checked http://spark.apache.org/docs/latest/monitoring.
> html#rest-api ?
>
> KP
>
> On Wed, Dec 7, 2016 at 11:43 AM, Richard Startin <
> richardstar...@outlook.com> wrote:
>
>> Is there any way to get this information as CSV/JSON?
>>
>>
>> https://docs.databricks.com/_images/CompletedBatches.png
>>
>>
>> https://richardstartin.com/
>>
>>
>> --
>> *From:* Richard Startin 
>> *Sent:* 05 December 2016 15:55
>> *To:* user@spark.apache.org
>> *Subject:* Spark streaming completed batches statistics
>>
>> Is there any way to get a more computer friendly version of the completes
>> batches section of the streaming page of the application master? I am very
>> interested in the statistics and am currently screen-scraping...
>>
>> https://richardstartin.com
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
Personally I think forcing the stream to fail (e.g. check offsets in
downstream store and throw exception if they aren't as expected) is
the safest thing to do.

If you proceed after a failure, you need a place to reliably record
the batches that failed for later processing.

On Wed, Dec 7, 2016 at 1:46 PM, map reduced  wrote:
> Hi,
>
> I am trying to solve this problem - in my streaming flow, every day few jobs
> fail due to some (say kafka cluster maintenance etc, mostly unavoidable)
> reasons for few batches and resumes back to success.
> I want to reprocess those failed jobs programmatically (assume I have a way
> of getting start-end offsets for kafka topics for failed jobs). I was
> thinking of these options:
> 1) Somehow pause streaming job when it detects failing jobs - this seems not
> possible.
> 2) From driver - run additional processing to check every few minutes using
> driver rest api (/api/v1/applications...) what jobs have failed and submit
> batch jobs for those failed jobs
>
> 1 - doesn't seem to be possible, and I don't want to kill streaming context
> just for few failing batches to stop the job for some time and resume after
> few minutes.
> 2 - seems like a viable option, but a little complicated, since even the
> batch job can fail due to whatever reasons and I am back to tracking that
> separately etc.
>
> Does anyone has faced this issue or have any suggestions?
>
> Thanks,
> KP

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
Thanks Marcelo,

Turns out I had missed setup steps in the actual file itself. Thanks to Richard 
for the help here. He pointed me to some java implementations.

I’m using the import org.apache.hadoop.security API.

I now have:

/* graphx_sp.scala */
import scala.util.Try
import scala.io.Source
import scala.util.parsing.json._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.apache.hadoop.security.UserGroupInformation

object graphx_sp {
def main(args: Array[String]){
// Settings
val conf = new SparkConf().setAppName("graphx_sp")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.setLogLevel("WARN")

val principal = conf.get("spark.yarn.principal")
val keytab = conf.get("spark.yarn.keytab")
val loginUser = UserGroupInformation.loginUserFromKeytab(principal, 
keytab)

UserGroupInformation.getLoginUser(loginUser)
## Actual code….

Running sbt returns:

src/main/scala/graphx_sp.scala:35: too many arguments for method getLoginUser: 
()org.apache.hadoop.security.UserGroupInformation
[error] UserGroupInformation.getLoginUser(loginUser)
[error]  ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed

The docs show that there should be two inputs, the principal and key tab. See 
here 
.
 

Can anyone point me to a tutorial or a run through of how to use Spark with 
Kerberos? This is proving to be quite confusing. Most search results on the 
topic point to what needs inputted at the point of `sparks submit` and not the 
changes needed in the actual src/main/.scala file

Gerry

> On 5 Dec 2016, at 19:45, Marcelo Vanzin  wrote:
> 
> That's not the error, that's just telling you the application failed.
> You have to look at the YARN logs for application_1479877553404_0041
> to see why it failed.
> 
> On Mon, Dec 5, 2016 at 10:44 AM, Gerard Casey  
> wrote:
>> Thanks Marcelo,
>> 
>> My understanding from a few pointers is that this may be due to insufficient 
>> read permissions to the key tab or a corrupt key tab. I have checked the 
>> read permissions and they are ok. I can see that it is initially configuring 
>> correctly:
>> 
>>   INFO security.UserGroupInformation: Login successful for user 
>> user@login_node using keytab file /path/to/keytab
>> 
>> I’ve added the full trace below.
>> 
>> Gerry
>> 
>> Full trace:
>> 
>> Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
>> Spark1 will be picked by default
>> 16/12/05 18:23:27 WARN util.NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> 16/12/05 18:23:27 INFO security.UserGroupInformation: Login successful for 
>> user me@login_nodeusing keytab file /path/to/keytab
>> 16/12/05 18:23:27 INFO yarn.Client: Attempting to login to the Kerberos 
>> using principal: me@login_node and keytab: /path/to/keytab
>> 16/12/05 18:23:28 INFO impl.TimelineClientImpl: Timeline service address: 
>> http://login_node1.xcat.cluster:8188/ws/v1/timeline/
>> 16/12/05 18:23:28 INFO client.RMProxy: Connecting to ResourceManager at 
>> login_node1.xcat.cluster/
>> 16/12/05 18:23:28 INFO client.AHSProxy: Connecting to Application History 
>> server at login_node1.xcat.cluster/
>> 16/12/05 18:23:28 WARN shortcircuit.DomainSocketFactory: The short-circuit 
>> local reads feature cannot be used because libhadoop cannot be loaded.
>> 16/12/05 18:23:28 INFO yarn.Client: Requesting a new application from 
>> cluster with 32 NodeManagers
>> 16/12/05 18:23:28 INFO yarn.Client: Verifying our application has not 
>> requested more than the maximum memory capability of the cluster (15360 MB 
>> per container)
>> 16/12/05 18:23:28 INFO yarn.Client: Will allocate AM container, with 1408 MB 
>> memory including 384 MB overhead
>> 16/12/05 18:23:28 INFO yarn.Client: Setting up container launch context for 
>> our AM
>> 16/12/05 18:23:28 INFO yarn.Client: Setting up the launch environment for 
>> our AM container
>> 16/12/05 18:23:28 INFO yarn.Client: Using the spark assembly jar on HDFS 
>> because you are using HDP, 
>> defaultSparkAssembly:hdfs://login_node1.xcat.cluster:8020/hdp/apps/2.5.0.0-1245/spark/spark-hdp-assembly.jar
>> 16/12/05 18:23:28 INFO yarn.Client: Credentials file set to:
>> 16/12/05 18:23:28 INFO yarn.YarnSparkHadoopUtil: getting token for namenode: 
>> hdfs://login_node1.xcat.cluster:8020/user/me/.sparkStaging/application_
>> 16/12/05 18:23:28 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
>> 1856 for me 

Re: Spark streaming completed batches statistics

2016-12-07 Thread Richard Startin
Ok it looks like I could reconstruct the logic in the Spark UI from the /jobs 
resource. Thanks.


https://richardstartin.com/



From: map reduced 
Sent: 07 December 2016 19:49
To: Richard Startin
Cc: user@spark.apache.org
Subject: Re: Spark streaming completed batches statistics

Have you checked http://spark.apache.org/docs/latest/monitoring.html#rest-api ?

KP

On Wed, Dec 7, 2016 at 11:43 AM, Richard Startin 
> wrote:

Is there any way to get this information as CSV/JSON?


https://docs.databricks.com/_images/CompletedBatches.png

[https://docs.databricks.com/_images/CompletedBatches.png]


https://richardstartin.com/



From: Richard Startin 
>
Sent: 05 December 2016 15:55
To: user@spark.apache.org
Subject: Spark streaming completed batches statistics

Is there any way to get a more computer friendly version of the completes 
batches section of the streaming page of the application master? I am very 
interested in the statistics and am currently screen-scraping...

https://richardstartin.com
-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




Re: Spark streaming completed batches statistics

2016-12-07 Thread map reduced
Have you checked
http://spark.apache.org/docs/latest/monitoring.html#rest-api ?

KP

On Wed, Dec 7, 2016 at 11:43 AM, Richard Startin  wrote:

> Is there any way to get this information as CSV/JSON?
>
>
> https://docs.databricks.com/_images/CompletedBatches.png
>
>
> https://richardstartin.com/
>
>
> --
> *From:* Richard Startin 
> *Sent:* 05 December 2016 15:55
> *To:* user@spark.apache.org
> *Subject:* Spark streaming completed batches statistics
>
> Is there any way to get a more computer friendly version of the completes
> batches section of the streaming page of the application master? I am very
> interested in the statistics and am currently screen-scraping...
>
> https://richardstartin.com
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Reprocessing failed jobs in Streaming job

2016-12-07 Thread map reduced
Hi,

I am trying to solve this problem - in my streaming flow, every day few
jobs fail due to some (say kafka cluster maintenance etc, mostly
unavoidable) reasons for few batches and resumes back to success.
I want to reprocess those failed jobs programmatically (assume I have a way
of getting start-end offsets for kafka topics for failed jobs). I was
thinking of these options:
1) Somehow pause streaming job when it detects failing jobs - this seems
not possible.
2) From driver - run additional processing to check every few minutes using
driver rest api (/api/v1/applications...) what jobs have failed and submit
batch jobs for those failed jobs

1 - doesn't seem to be possible, and I don't want to kill streaming context
just for few failing batches to stop the job for some time and resume after
few minutes.
2 - seems like a viable option, but a little complicated, since even the
batch job can fail due to whatever reasons and I am back to tracking that
separately etc.

Does anyone has faced this issue or have any suggestions?

Thanks,
KP


Re: Spark streaming completed batches statistics

2016-12-07 Thread Richard Startin
Is there any way to get this information as CSV/JSON?


https://docs.databricks.com/_images/CompletedBatches.png

[https://docs.databricks.com/_images/CompletedBatches.png]


https://richardstartin.com/



From: Richard Startin 
Sent: 05 December 2016 15:55
To: user@spark.apache.org
Subject: Spark streaming completed batches statistics

Is there any way to get a more computer friendly version of the completes 
batches section of the streaming page of the application master? I am very 
interested in the statistics and am currently screen-scraping...

https://richardstartin.com
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



create new spark context from ipython or jupyter

2016-12-07 Thread pseudo oduesp
Hi,
how we can create new sparkcontext from Ipython or jupyter session
i mean if i use current sparkcontext and i run sc.stop()

how i can launch new one from ipython without restart newsession of ipython
by refreshing browser ??

why i code some functions and i figreout i forgot  something insde function
,but that function i add it by two way :


   i simply add the module by
  sys.path.append(pathofmodule)
 or by sc.addpyFile from (if i apply on each workers )

somone can explain me how i can make unit test in pyspark ?
i should make it in local or in cluster and how i can do that in pyspark ?

thank you for advance .


filter RDD by variable

2016-12-07 Thread Soheila S.
Hi
I am new in Spark and have a question in first steps of Spark learning.

How can I filter an RDD using an String variable (for example words[i]) ,
instead of a fix one like "Error"?

Thanks a lot in advance.
Soheila


Re: Not per-key state in spark streaming

2016-12-07 Thread Daniel Haviv
Hi Anty,
What you could do is keep in the state only the existence of a key and when
necessary pull it from a secondary state store like HDFS or HBASE.

Daniel

On Wed, Dec 7, 2016 at 1:42 PM, Anty Rao  wrote:

> Hi
> I'm new to Spark. I'm doing some research to see if spark streaming can
> solve my problem. I don't want to keep per-key state,b/c my data set is
> very huge and keep a little longer time, it not viable to keep all per key
> state in memory.Instead, i want to have a bloom filter based state. Does it
> possible to achieve this in Spark streaming.
>
> --
> Anty Rao
>


Not per-key state in spark streaming

2016-12-07 Thread Anty Rao
Hi
I'm new to Spark. I'm doing some research to see if spark streaming can
solve my problem. I don't want to keep per-key state,b/c my data set is
very huge and keep a little longer time, it not viable to keep all per key
state in memory.Instead, i want to have a bloom filter based state. Does it
possible to achieve this in Spark streaming.

-- 
Anty Rao


Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-07 Thread Hyukjin Kwon
Let me please just extend the suggestion a bit more verbosely.

I think you could try something like this maybe.

val jsonDF = spark.read
  .option("columnNameOfCorruptRecord", "xxx")
  .option("mode","PERMISSIVE")
  .schema(StructType(schema.fields :+ StructField("xxx", StringType, true)))
  .json(corruptRecords)
val malformed = jsonDF.filter("xxx is not null").select("xxx")
malformed.show()

This prints something like the ones below:

++
| xxx|
++
|   {|
|{"a":1, b:2}|
|{"a":{, b:3}|
|   ]|
++

​

If the schema is not specified, then the inferred schema has the malformed
column automatically

but in case of specifying the schema, I believe this should be manually set.




2016-12-07 18:06 GMT+09:00 Yehuda Finkelstein :

> Hi
>
>
>
> I tried it already but it say that this column doesn’t exists.
>
>
>
> scala> var df = spark.sqlContext.read.
>
>  | option("columnNameOfCorruptRecord","xxx").
>
>  | option("mode","PERMISSIVE").
>
>  | schema(df_schema.schema).json(f)
>
> df: org.apache.spark.sql.DataFrame = [auctionid: string, timestamp: string
> ... 37 more fields]
>
>
>
> scala> df.select
>
> select   selectExpr
>
>
>
> scala> df.select("xxx").show
>
> org.apache.spark.sql.AnalysisException: cannot resolve '`xxx`' given
> input columns: […];;
>
>
>
>   at org.apache.spark.sql.catalyst.analysis.package$
> AnalysisErrorAt.failAnalysis(package.scala:42)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(
> CheckAnalysis.scala:77)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(
> CheckAnalysis.scala:74)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformUp$1.apply(TreeNode.scala:308)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformUp$1.apply(TreeNode.scala:308)
>
>   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.
> withOrigin(TreeNode.scala:69)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(
> TreeNode.scala:307)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.
> transformExpressionUp$1(QueryPlan.scala:269)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$
> spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$
> 2(QueryPlan.scala:279)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$
> apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(
> QueryPlan.scala:283)
>
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>
>   at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
>
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$
> spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$
> 2(QueryPlan.scala:283)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.
> apply(QueryPlan.scala:288)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:186)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(
> QueryPlan.scala:288)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
> TreeNode.scala:126)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
> checkAnalysis(CheckAnalysis.scala:67)
>
>   at org.apache.spark.sql.catalyst.analysis.Analyzer.
> checkAnalysis(Analyzer.scala:58)
>
>   at org.apache.spark.sql.execution.QueryExecution.
> assertAnalyzed(QueryExecution.scala:49)
>
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>
>   at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$
> withPlan(Dataset.scala:2603)
>
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
>
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:987)
>
>   ... 48 elided
>
>
>
> scala>
>
>
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Tuesday, December 06, 2016 10:26 PM
> *To:* Yehuda Finkelstein
> *Cc:* user
> *Subject:* Re: get corrupted rows using columnNameOfCorruptRecord
>
>
>
> .where("xxx IS NOT NULL") will give you the rows that couldn't be parsed.
>
>
>
> On Tue, Dec 6, 2016 at 6:31 AM, Yehuda Finkelstein <
> yeh...@veracity-group.com> wrote:
>
> Hi all
>
>
>
> I’m trying to parse json using existing schema and got rows with NULL’s
>
> //get schema
>
> 

Debugging persistence of custom estimator

2016-12-07 Thread geoHeil
Hi,

I am writing my first own spark pipeline components with persistence and
have troubles debugging them.
https://github.com/geoHeil/sparkCustomEstimatorPersistenceProblem holds a
minimal example where
`sbt run` and `sbt test` result in "different" errors.

When I tried to debug it in IntelliJ I only got a very strange JNI error
when run / debugged via its internal launcher.

Any input would be great.
Thanks a lot.
Kind regards,
Georg



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Debugging-persistence-of-custom-estimator-tp28178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.0.2 with Spark JobServer

2016-12-07 Thread Jose Carlos Guevara Turruelles
Hi,

 

I'm working wiht the latest version of Spark JobServer together with Spark
2.0.2. I'm able to do almost all my needs but there is only one noisy thing.


I have placed a hive-site.xml to specify a connection to my mysql db so I
can have the metastore_db on mysql, that's works fine while I'm submitting
my jobs through spark-submit. 

 

But once I start using the Spark JobServer doesn't work the only parameter
take it from the hive-site.xml is the hive.metastore.warehouse.dir . Even I
have tried to use the default, I mean without the hive-site.xml so spark
should use derby to store the metastore_db, but also doesn't work.

 

I don't have hive installed, I just use the hive support which comes with
sparkl.

 

this is my hive-site.xml

 







  javax.jdo.option.ConnectionURL

 
jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=tru
e

  JDBC connect string for a JDBC metastore



 



  javax.jdo.option.ConnectionDriverName

  com.mysql.jdbc.Driver

  Driver class name for a JDBC metastore



 



  javax.jdo.option.ConnectionUserName

  user

  username to use against metastore database



 



  javax.jdo.option.ConnectionPassword

  passwork

  password to use against metastore database



 



  hive.metastore.warehouse.dir

  /data/spark-warehouse

  Warehouse Location





 



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


Identifying DataFrames in executing tasks

2016-12-07 Thread Aniket More
Hi,
I am doing a POC in which I have implemented custom Spark Listener.
I have overridden methods such as onTaskEnd(taskEnd:
SparkListenerTaskEnd),onStageCompleted(stageCompleted:
SparkListenerStageCompleted),etc.
from which I get information such as
taskId,recordsWritten,stageId,recordsRead,etc.
But I am not able to identify the dataframe executing in a task. 
For eg: I need to identify the dataframe where input file is read or task in
which dataframes are joined.

Can someone suggest me some solution for above use case where I can get
dataframes information in executing tasks?
Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Identifying-DataFrames-in-executing-tasks-tp28176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Livy with Spark

2016-12-07 Thread Saisai Shao
Hi Mich,

1. Each user could create a Livy session (batch or interactive), one
session is backed by one Spark application, and the resource quota is the
same as normal spark application (configured by
spark.executor.cores/memory,. etc), and this will be passed to yarn if
running on Yarn. This is basically no different compared to normal spark
application.
2. Livy is just a Rest service, inside Livy you could create multiple
sessions, each session is mapped to one spark application and resource
managed under the cluster manager (like YARN).
3. One user could create one session, this session will be run as the user,
it is similar to HiveServer's user impersonation or poxy user.



On Wed, Dec 7, 2016 at 5:51 AM, Mich Talebzadeh 
wrote:

> Thanks Richard.
>
> I saw your question in the above blog:
>
> How does Livy proxy the user? Per task? Do you know how quotas are
> assigned to users, like how do you stop one Livy user from using all of the
> resources available to the Executors?
>
> My points are:
>
>
>1. Still don't understand how quotas are assigned to users. Is that
>done by YARN?
>2. What will happen if more than one Livy is running on the same
>cluster all controlled by the same YARN. how resouces are allocated
>
> cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 December 2016 at 14:50, Richard Startin 
> wrote:
>
>> There is a great write up on Livy at
>> http://henning.kropponline.de/2016/11/06/
>>
>> On 5 Dec 2016, at 14:34, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> Has there been any experience using Livy with Spark to share multiple
>> Spark contexts?
>>
>> thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


RE: get corrupted rows using columnNameOfCorruptRecord

2016-12-07 Thread Yehuda Finkelstein
Hi



I tried it already but it say that this column doesn’t exists.



scala> var df = spark.sqlContext.read.

 | option("columnNameOfCorruptRecord","xxx").

 | option("mode","PERMISSIVE").

 | schema(df_schema.schema).json(f)

df: org.apache.spark.sql.DataFrame = [auctionid: string, timestamp: string
... 37 more fields]



scala> df.select

select   selectExpr



scala> df.select("xxx").show

org.apache.spark.sql.AnalysisException: cannot resolve '`xxx`' given input
columns: […];;



  at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)

  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)

  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)

  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)

  at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)

  at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)

  at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)

  at org.apache.spark.sql.catalyst.plans.QueryPlan.org
$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)

  at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)

  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

  at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

  at scala.collection.AbstractTraversable.map(Traversable.scala:104)

  at org.apache.spark.sql.catalyst.plans.QueryPlan.org
$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:283)

  at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.apply(QueryPlan.scala:288)

  at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)

  at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:288)

  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)

  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)

  at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)

  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)

  at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)

  at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)

  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

  at org.apache.spark.sql.Dataset.org
$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)

  at org.apache.spark.sql.Dataset.select(Dataset.scala:969)

  at org.apache.spark.sql.Dataset.select(Dataset.scala:987)

  ... 48 elided



scala>





*From:* Michael Armbrust [mailto:mich...@databricks.com]
*Sent:* Tuesday, December 06, 2016 10:26 PM
*To:* Yehuda Finkelstein
*Cc:* user
*Subject:* Re: get corrupted rows using columnNameOfCorruptRecord



.where("xxx IS NOT NULL") will give you the rows that couldn't be parsed.



On Tue, Dec 6, 2016 at 6:31 AM, Yehuda Finkelstein <
yeh...@veracity-group.com> wrote:

Hi all



I’m trying to parse json using existing schema and got rows with NULL’s

//get schema

val df_schema = spark.sqlContext.sql("select c1,c2,…cn t1  limit 1")

//read json file

val f = sc.textFile("/tmp/x")

//load json into data frame using schema

var df =
spark.sqlContext.read.option("columnNameOfCorruptRecord","xxx").option("mode","PERMISSIVE").schema(df_schema.schema).json(f)



in documentation it say that you can query the corrupted rows by this
columns à columnNameOfCorruptRecord

o“columnNameOfCorruptRecord (default is the value specified in
spark.sql.columnNameOfCorruptRecord): allows renaming the new field having
malformed string created by PERMISSIVE mode. This overrides
spark.sql.columnNameOfCorruptRecord.”



The question is how to fetch those corrupted rows ?





Thanks

Yehuda


Not per-key state

2016-12-07 Thread Anty Rao
Hi ALL
I'm new to Spark. I'm doing some research to see if spark streaming can
solve my problem. I don't want to keep per-key state,b/c my data set is
very huge, it not viable to keep all per key state in memory.Instead, i
want to have a bloom filter based state. Does it possible to achieve this
in Spark streaming.

-- 
Thanks

Anty Rao