Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
Thanks Marcin, That seems to be the case. It explains why there is no documentation on this part too! To be specific, where exactly should spark.authenticate be set to true? Many thanks, Gerry > On 8 Dec 2016, at 08:46, Marcin Pastecki wrote: > > My understanding

How to clean the cache when i do performance test in spark

2016-12-07 Thread Zhang, Liyun
Hi all: When I test my spark application, I found that the second round(application_1481153226569_0002) is more faster than first round(application_1481153226569_0001). Actually the configuration is same. I guess the second round is improved a lot by cache. So how can I clean the cache?

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Marcin Pastecki
My understanding is that the token generation is handled by Spark itself as long as you were authenticated in Kerberos when submitting the job and spark.authenticate is set to true. --keytab and --principal options should be used for "long" running job, when you may need to do ticket renewal.

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
I just read an interesting comment on cloudera: What does it mean by “when the job is submitted,and you have a kinit, you will have TOKEN to access HDFS, you would need to pass that on, or the KERBEROS ticket” ? Reference

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
Thanks Marcelo. I’ve completely removed it. Ok - even if I read/write from HDFS? Trying to the SparkPi example now G > On 7 Dec 2016, at 22:10, Marcelo Vanzin wrote: > > Have you removed all the code dealing with Kerberos that you posted? > You should not be setting

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-07 Thread Sonal Goyal
You can try updating metrics.properties for the sink of your choice. In our case, we add the following for getting application metrics in JSON format using http *.sink.reifier.class= org.apache.spark.metrics.sink.MetricsServlet Here, we have defined the sink with name reifier and its class is

Re: WARN util.NativeCodeLoader

2016-12-07 Thread Sean Owen
You can ignore it. You can also install the native libs in question but it's just a minor accelerator. On Thu, Dec 8, 2016 at 2:36 PM baipeng wrote: > Hi ALL > > I’m new to Spark.When I execute spark-shell, the first line is as follows > WARN util.NativeCodeLoader: Unable to

WARN util.NativeCodeLoader

2016-12-07 Thread baipeng
Hi ALL I’m new to Spark.When I execute spark-shell, the first line is as follows WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable. Can someone tell me how to solve the problem?

unscribe

2016-12-07 Thread smith_666

Unsubscribe

2016-12-07 Thread Roger Holenweger
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Unsubscribe

2016-12-07 Thread Prashant Singh Thakur
Best Regards, Prashant Thakur Work : 6046 Mobile: +91-9740266522 NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received

Unsubscribe

2016-12-07 Thread Ajit Jaokar
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Unsubscribe

2016-12-07 Thread Kranthi Gmail
-- Kranthi PS: Sent from mobile, pls excuse the brevity and typos. > On Dec 7, 2016, at 8:05 PM, Siddhartha Khaitan > wrote: > >

Unsubscribe

2016-12-07 Thread Siddhartha Khaitan

unsubscribe

2016-12-07 Thread Ajith Jose

Re: Not per-key state in spark streaming

2016-12-07 Thread Anty Rao
On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao wrote: > Hi > I'm new to Spark. I'm doing some research to see if spark streaming can > solve my problem. I don't want to keep per-key state,b/c my data set is > very huge and keep a little longer time, it not viable to keep all per key

Re: Not per-key state in spark streaming

2016-12-07 Thread Anty Rao
@Daniel Thanks for your reply. I will try it. On Wed, Dec 7, 2016 at 8:47 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi Anty, > What you could do is keep in the state only the existence of a key and > when necessary pull it from a secondary state store like HDFS or HBASE. > >

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Iman Mohtashemi
yes exactly. I run mine fine in Eclipse but when I run it from a corresponding jar I get the same error! On Wed, Dec 7, 2016 at 5:04 PM Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > I believe, it's not about the location (i.e., local machine or HDFS) but > it's all about the

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Md. Rezaul Karim
I believe, it's not about the location (i.e., local machine or HDFS) but it's all about the format of the input file. For example, I am getting the following error while trying to read an input file in libsvm format: *Exception in thread "main" java.lang.ClassNotFoundException: Failed to find

StreamingContext.textFileStream(...)

2016-12-07 Thread muthu
Hello there, I am trying to find a way to get the file-name of the current file being processed from the monitored directory for HDFS... Meaning, Let's say... val lines = ssc.textFileStream("my_hdfs_location") lines.map { (row: String) => ... } //No access to file-name here Also, let's say I

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Iman Mohtashemi
No but I tried that too and still didn't work. Where are the files being read from? From the local machine or HDFS? Do I need to get the files to HDFS first? In Eclipse I just point to the location of the directory? On Wed, Dec 7, 2016 at 3:34 PM Md. Rezaul Karim <

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Md. Rezaul Karim
Hi, You should prepare your jar file (from your Spark application written in Java) with all the necessary dependencies. You can create a Maven project on Eclipse by specifying the dependencies in a Maven friendly pom.xml file. For building the jar with the dependencies and *main class (since you

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Gmail
Don't you need to provide your class name "JavaWordCount"? Thanks, Vasu. > On Dec 7, 2016, at 3:18 PM, im281 wrote: > > Hello, > I have a simple word count example in Java and I can run this in Eclipse > (code at the bottom) > > I then create a jar file from it and

Running spark from Eclipse and then Jar

2016-12-07 Thread im281
Hello, I have a simple word count example in Java and I can run this in Eclipse (code at the bottom) I then create a jar file from it and try to run it from the cmd java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt But I get this error? I think the main error is: *Exception in

Pruning decision tree to create an optimal tree

2016-12-07 Thread Md. Rezaul Karim
Hi there, Say, I have a deeper tree that needs to be pruned to create an optimal tree. For example, in R it can be done using the *rpart *and *prune *function. Is it possible to prune the MLlib-based decision tree while performing the classification or regression? Regards,

Accessing classpath resources in Spark Shell

2016-12-07 Thread Michal Šenkýř
Hello everyone, I recently encountered a situation where I needed to add a custom classpath resource to my driver and access it from an included library (specifically a configuration file for a custom Dataframe Reader). I need to use it from both inside an application which I submit to the

Driver/Executor Memory values during Unit Testing

2016-12-07 Thread Aleksander Eskilson
Hi there, I've been trying to increase the spark.driver.memory and spark.executor.memory during some unit tests. Most of the information I can find about increasing memory for Spark is based on either flags to spark-submit, or settings in the spark-defaults.conf file. Running unit tests with

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Marcelo Vanzin
Have you removed all the code dealing with Kerberos that you posted? You should not be setting those principal / keytab configs. Literally all you have to do is login with kinit then run spark-submit. Try with the SparkPi example for instance, instead of your own code. If that doesn't work, you

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
Thanks. I’ve checked the TGT, principal and key tab. Where to next?! > On 7 Dec 2016, at 22:03, Marcelo Vanzin wrote: > > On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey > wrote: >> Can anyone point me to a tutorial or a run through of how to

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Marcelo Vanzin
On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey wrote: > Can anyone point me to a tutorial or a run through of how to use Spark with > Kerberos? This is proving to be quite confusing. Most search results on the > topic point to what needs inputted at the point of `sparks

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
If your operations are idempotent, you should be able to just run a totally separate job that looks for failed batches and does a kafkaRDD to reprocess that batch. C* probably isn't the first choice for what is essentially a queue, but if the frequency of batches is relatively low it probably

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread map reduced
> > Personally I think forcing the stream to fail (e.g. check offsets in > downstream store and throw exception if they aren't as expected) is > the safest thing to do. I would think so too, but just for say 2-3 (sometimes just 1) failed batches in a whole day, I am trying to not kill the whole

Re: Spark streaming completed batches statistics

2016-12-07 Thread map reduced
Just keep in mind that rest-api needs to be called from driver ui endpoint and not from Spark/Master UI. On Wed, Dec 7, 2016 at 12:03 PM, Richard Startin wrote: > Ok it looks like I could reconstruct the logic in the Spark UI from the > /jobs resource. Thanks. > > >

Re: Reprocessing failed jobs in Streaming job

2016-12-07 Thread Cody Koeninger
Personally I think forcing the stream to fail (e.g. check offsets in downstream store and throw exception if they aren't as expected) is the safest thing to do. If you proceed after a failure, you need a place to reliably record the batches that failed for later processing. On Wed, Dec 7, 2016

Re: Kerberos and YARN - functions in spark-shell and spark submit local but not cluster mode

2016-12-07 Thread Gerard Casey
Thanks Marcelo, Turns out I had missed setup steps in the actual file itself. Thanks to Richard for the help here. He pointed me to some java implementations. I’m using the import org.apache.hadoop.security API. I now have: /* graphx_sp.scala */ import scala.util.Try import scala.io.Source

Re: Spark streaming completed batches statistics

2016-12-07 Thread Richard Startin
Ok it looks like I could reconstruct the logic in the Spark UI from the /jobs resource. Thanks. https://richardstartin.com/ From: map reduced Sent: 07 December 2016 19:49 To: Richard Startin Cc: user@spark.apache.org Subject: Re: Spark

Re: Spark streaming completed batches statistics

2016-12-07 Thread map reduced
Have you checked http://spark.apache.org/docs/latest/monitoring.html#rest-api ? KP On Wed, Dec 7, 2016 at 11:43 AM, Richard Startin wrote: > Is there any way to get this information as CSV/JSON? > > > https://docs.databricks.com/_images/CompletedBatches.png > > >

Reprocessing failed jobs in Streaming job

2016-12-07 Thread map reduced
Hi, I am trying to solve this problem - in my streaming flow, every day few jobs fail due to some (say kafka cluster maintenance etc, mostly unavoidable) reasons for few batches and resumes back to success. I want to reprocess those failed jobs programmatically (assume I have a way of getting

Re: Spark streaming completed batches statistics

2016-12-07 Thread Richard Startin
Is there any way to get this information as CSV/JSON? https://docs.databricks.com/_images/CompletedBatches.png [https://docs.databricks.com/_images/CompletedBatches.png] https://richardstartin.com/ From: Richard Startin Sent: 05

create new spark context from ipython or jupyter

2016-12-07 Thread pseudo oduesp
Hi, how we can create new sparkcontext from Ipython or jupyter session i mean if i use current sparkcontext and i run sc.stop() how i can launch new one from ipython without restart newsession of ipython by refreshing browser ?? why i code some functions and i figreout i forgot something insde

filter RDD by variable

2016-12-07 Thread Soheila S.
Hi I am new in Spark and have a question in first steps of Spark learning. How can I filter an RDD using an String variable (for example words[i]) , instead of a fix one like "Error"? Thanks a lot in advance. Soheila

Re: Not per-key state in spark streaming

2016-12-07 Thread Daniel Haviv
Hi Anty, What you could do is keep in the state only the existence of a key and when necessary pull it from a secondary state store like HDFS or HBASE. Daniel On Wed, Dec 7, 2016 at 1:42 PM, Anty Rao wrote: > Hi > I'm new to Spark. I'm doing some research to see if spark

Not per-key state in spark streaming

2016-12-07 Thread Anty Rao
Hi I'm new to Spark. I'm doing some research to see if spark streaming can solve my problem. I don't want to keep per-key state,b/c my data set is very huge and keep a little longer time, it not viable to keep all per key state in memory.Instead, i want to have a bloom filter based state. Does it

Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-07 Thread Hyukjin Kwon
Let me please just extend the suggestion a bit more verbosely. I think you could try something like this maybe. val jsonDF = spark.read .option("columnNameOfCorruptRecord", "xxx") .option("mode","PERMISSIVE") .schema(StructType(schema.fields :+ StructField("xxx", StringType, true)))

Debugging persistence of custom estimator

2016-12-07 Thread geoHeil
Hi, I am writing my first own spark pipeline components with persistence and have troubles debugging them. https://github.com/geoHeil/sparkCustomEstimatorPersistenceProblem holds a minimal example where `sbt run` and `sbt test` result in "different" errors. When I tried to debug it in IntelliJ I

Spark 2.0.2 with Spark JobServer

2016-12-07 Thread Jose Carlos Guevara Turruelles
Hi, I'm working wiht the latest version of Spark JobServer together with Spark 2.0.2. I'm able to do almost all my needs but there is only one noisy thing. I have placed a hive-site.xml to specify a connection to my mysql db so I can have the metastore_db on mysql, that's works fine while

Identifying DataFrames in executing tasks

2016-12-07 Thread Aniket More
Hi, I am doing a POC in which I have implemented custom Spark Listener. I have overridden methods such as onTaskEnd(taskEnd: SparkListenerTaskEnd),onStageCompleted(stageCompleted: SparkListenerStageCompleted),etc. from which I get information such as taskId,recordsWritten,stageId,recordsRead,etc.

Re: Livy with Spark

2016-12-07 Thread Saisai Shao
Hi Mich, 1. Each user could create a Livy session (batch or interactive), one session is backed by one Spark application, and the resource quota is the same as normal spark application (configured by spark.executor.cores/memory,. etc), and this will be passed to yarn if running on Yarn. This is

RE: get corrupted rows using columnNameOfCorruptRecord

2016-12-07 Thread Yehuda Finkelstein
Hi I tried it already but it say that this column doesn’t exists. scala> var df = spark.sqlContext.read. | option("columnNameOfCorruptRecord","xxx"). | option("mode","PERMISSIVE"). | schema(df_schema.schema).json(f) df: org.apache.spark.sql.DataFrame = [auctionid: string,

Not per-key state

2016-12-07 Thread Anty Rao
Hi ALL I'm new to Spark. I'm doing some research to see if spark streaming can solve my problem. I don't want to keep per-key state,b/c my data set is very huge, it not viable to keep all per key state in memory.Instead, i want to have a bloom filter based state. Does it possible to achieve this