Thanks Marcin,
That seems to be the case. It explains why there is no documentation on this
part too!
To be specific, where exactly should spark.authenticate be set to true?
Many thanks,
Gerry
> On 8 Dec 2016, at 08:46, Marcin Pastecki wrote:
>
> My understanding
Hi all:
When I test my spark application, I found that the second
round(application_1481153226569_0002) is more faster than first
round(application_1481153226569_0001). Actually the configuration is same. I
guess the second round is improved a lot by cache. So how can I clean the cache?
My understanding is that the token generation is handled by Spark itself as
long as you were authenticated in Kerberos when submitting the job and
spark.authenticate is set to true.
--keytab and --principal options should be used for "long" running job,
when you may need to do ticket renewal.
I just read an interesting comment on cloudera:
What does it mean by “when the job is submitted,and you have a kinit, you will
have TOKEN to access HDFS, you would need to pass that on, or the KERBEROS
ticket” ?
Reference
Thanks Marcelo.
I’ve completely removed it. Ok - even if I read/write from HDFS?
Trying to the SparkPi example now
G
> On 7 Dec 2016, at 22:10, Marcelo Vanzin wrote:
>
> Have you removed all the code dealing with Kerberos that you posted?
> You should not be setting
You can try updating metrics.properties for the sink of your choice. In our
case, we add the following for getting application metrics in JSON format
using http
*.sink.reifier.class= org.apache.spark.metrics.sink.MetricsServlet
Here, we have defined the sink with name reifier and its class is
You can ignore it. You can also install the native libs in question but
it's just a minor accelerator.
On Thu, Dec 8, 2016 at 2:36 PM baipeng wrote:
> Hi ALL
>
> I’m new to Spark.When I execute spark-shell, the first line is as follows
> WARN util.NativeCodeLoader: Unable to
Hi ALL
I’m new to Spark.When I execute spark-shell, the first line is as follows
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable.
Can someone tell me how to solve the problem?
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Best Regards,
Prashant Thakur
Work : 6046
Mobile: +91-9740266522
NOTE: This message may contain information that is confidential, proprietary,
privileged or otherwise protected by law. The message is intended solely for
the named addressee. If received
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
--
Kranthi
PS: Sent from mobile, pls excuse the brevity and typos.
> On Dec 7, 2016, at 8:05 PM, Siddhartha Khaitan
> wrote:
>
>
On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao wrote:
> Hi
> I'm new to Spark. I'm doing some research to see if spark streaming can
> solve my problem. I don't want to keep per-key state,b/c my data set is
> very huge and keep a little longer time, it not viable to keep all per key
@Daniel
Thanks for your reply. I will try it.
On Wed, Dec 7, 2016 at 8:47 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi Anty,
> What you could do is keep in the state only the existence of a key and
> when necessary pull it from a secondary state store like HDFS or HBASE.
>
>
yes exactly. I run mine fine in Eclipse but when I run it from a
corresponding jar I get the same error!
On Wed, Dec 7, 2016 at 5:04 PM Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:
> I believe, it's not about the location (i.e., local machine or HDFS) but
> it's all about the
I believe, it's not about the location (i.e., local machine or HDFS) but
it's all about the format of the input file. For example, I am getting the
following error while trying to read an input file in libsvm format:
*Exception in thread "main" java.lang.ClassNotFoundException: Failed to
find
Hello there,
I am trying to find a way to get the file-name of the current file being
processed from the monitored directory for HDFS...
Meaning,
Let's say...
val lines = ssc.textFileStream("my_hdfs_location")
lines.map { (row: String) => ... } //No access to file-name here
Also, let's say I
No but I tried that too and still didn't work. Where are the files being
read from? From the local machine or HDFS? Do I need to get the files to
HDFS first? In Eclipse I just point to the location of the directory?
On Wed, Dec 7, 2016 at 3:34 PM Md. Rezaul Karim <
Hi,
You should prepare your jar file (from your Spark application written in
Java) with all the necessary dependencies. You can create a Maven project
on Eclipse by specifying the dependencies in a Maven friendly pom.xml file.
For building the jar with the dependencies and *main class (since you
Don't you need to provide your class name "JavaWordCount"?
Thanks,
Vasu.
> On Dec 7, 2016, at 3:18 PM, im281 wrote:
>
> Hello,
> I have a simple word count example in Java and I can run this in Eclipse
> (code at the bottom)
>
> I then create a jar file from it and
Hello,
I have a simple word count example in Java and I can run this in Eclipse
(code at the bottom)
I then create a jar file from it and try to run it from the cmd
java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt
But I get this error?
I think the main error is:
*Exception in
Hi there,
Say, I have a deeper tree that needs to be pruned to create an optimal
tree. For example, in R it can be done using the *rpart *and *prune *function.
Is it possible to prune the MLlib-based decision tree while performing the
classification or regression?
Regards,
Hello everyone,
I recently encountered a situation where I needed to add a custom
classpath resource to my driver and access it from an included library
(specifically a configuration file for a custom Dataframe Reader).
I need to use it from both inside an application which I submit to the
Hi there,
I've been trying to increase the spark.driver.memory and
spark.executor.memory during some unit tests. Most of the information I can
find about increasing memory for Spark is based on either flags to
spark-submit, or settings in the spark-defaults.conf file. Running unit
tests with
Have you removed all the code dealing with Kerberos that you posted?
You should not be setting those principal / keytab configs.
Literally all you have to do is login with kinit then run spark-submit.
Try with the SparkPi example for instance, instead of your own code.
If that doesn't work, you
Thanks.
I’ve checked the TGT, principal and key tab. Where to next?!
> On 7 Dec 2016, at 22:03, Marcelo Vanzin wrote:
>
> On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey
> wrote:
>> Can anyone point me to a tutorial or a run through of how to
On Wed, Dec 7, 2016 at 12:15 PM, Gerard Casey wrote:
> Can anyone point me to a tutorial or a run through of how to use Spark with
> Kerberos? This is proving to be quite confusing. Most search results on the
> topic point to what needs inputted at the point of `sparks
If your operations are idempotent, you should be able to just run a
totally separate job that looks for failed batches and does a kafkaRDD
to reprocess that batch. C* probably isn't the first choice for what
is essentially a queue, but if the frequency of batches is relatively
low it probably
>
> Personally I think forcing the stream to fail (e.g. check offsets in
> downstream store and throw exception if they aren't as expected) is
> the safest thing to do.
I would think so too, but just for say 2-3 (sometimes just 1) failed
batches in a whole day, I am trying to not kill the whole
Just keep in mind that rest-api needs to be called from driver ui endpoint
and not from Spark/Master UI.
On Wed, Dec 7, 2016 at 12:03 PM, Richard Startin wrote:
> Ok it looks like I could reconstruct the logic in the Spark UI from the
> /jobs resource. Thanks.
>
>
>
Personally I think forcing the stream to fail (e.g. check offsets in
downstream store and throw exception if they aren't as expected) is
the safest thing to do.
If you proceed after a failure, you need a place to reliably record
the batches that failed for later processing.
On Wed, Dec 7, 2016
Thanks Marcelo,
Turns out I had missed setup steps in the actual file itself. Thanks to Richard
for the help here. He pointed me to some java implementations.
I’m using the import org.apache.hadoop.security API.
I now have:
/* graphx_sp.scala */
import scala.util.Try
import scala.io.Source
Ok it looks like I could reconstruct the logic in the Spark UI from the /jobs
resource. Thanks.
https://richardstartin.com/
From: map reduced
Sent: 07 December 2016 19:49
To: Richard Startin
Cc: user@spark.apache.org
Subject: Re: Spark
Have you checked
http://spark.apache.org/docs/latest/monitoring.html#rest-api ?
KP
On Wed, Dec 7, 2016 at 11:43 AM, Richard Startin wrote:
> Is there any way to get this information as CSV/JSON?
>
>
> https://docs.databricks.com/_images/CompletedBatches.png
>
>
>
Hi,
I am trying to solve this problem - in my streaming flow, every day few
jobs fail due to some (say kafka cluster maintenance etc, mostly
unavoidable) reasons for few batches and resumes back to success.
I want to reprocess those failed jobs programmatically (assume I have a way
of getting
Is there any way to get this information as CSV/JSON?
https://docs.databricks.com/_images/CompletedBatches.png
[https://docs.databricks.com/_images/CompletedBatches.png]
https://richardstartin.com/
From: Richard Startin
Sent: 05
Hi,
how we can create new sparkcontext from Ipython or jupyter session
i mean if i use current sparkcontext and i run sc.stop()
how i can launch new one from ipython without restart newsession of ipython
by refreshing browser ??
why i code some functions and i figreout i forgot something insde
Hi
I am new in Spark and have a question in first steps of Spark learning.
How can I filter an RDD using an String variable (for example words[i]) ,
instead of a fix one like "Error"?
Thanks a lot in advance.
Soheila
Hi Anty,
What you could do is keep in the state only the existence of a key and when
necessary pull it from a secondary state store like HDFS or HBASE.
Daniel
On Wed, Dec 7, 2016 at 1:42 PM, Anty Rao wrote:
> Hi
> I'm new to Spark. I'm doing some research to see if spark
Hi
I'm new to Spark. I'm doing some research to see if spark streaming can
solve my problem. I don't want to keep per-key state,b/c my data set is
very huge and keep a little longer time, it not viable to keep all per key
state in memory.Instead, i want to have a bloom filter based state. Does it
Let me please just extend the suggestion a bit more verbosely.
I think you could try something like this maybe.
val jsonDF = spark.read
.option("columnNameOfCorruptRecord", "xxx")
.option("mode","PERMISSIVE")
.schema(StructType(schema.fields :+ StructField("xxx", StringType, true)))
Hi,
I am writing my first own spark pipeline components with persistence and
have troubles debugging them.
https://github.com/geoHeil/sparkCustomEstimatorPersistenceProblem holds a
minimal example where
`sbt run` and `sbt test` result in "different" errors.
When I tried to debug it in IntelliJ I
Hi,
I'm working wiht the latest version of Spark JobServer together with Spark
2.0.2. I'm able to do almost all my needs but there is only one noisy thing.
I have placed a hive-site.xml to specify a connection to my mysql db so I
can have the metastore_db on mysql, that's works fine while
Hi,
I am doing a POC in which I have implemented custom Spark Listener.
I have overridden methods such as onTaskEnd(taskEnd:
SparkListenerTaskEnd),onStageCompleted(stageCompleted:
SparkListenerStageCompleted),etc.
from which I get information such as
taskId,recordsWritten,stageId,recordsRead,etc.
Hi Mich,
1. Each user could create a Livy session (batch or interactive), one
session is backed by one Spark application, and the resource quota is the
same as normal spark application (configured by
spark.executor.cores/memory,. etc), and this will be passed to yarn if
running on Yarn. This is
Hi
I tried it already but it say that this column doesn’t exists.
scala> var df = spark.sqlContext.read.
| option("columnNameOfCorruptRecord","xxx").
| option("mode","PERMISSIVE").
| schema(df_schema.schema).json(f)
df: org.apache.spark.sql.DataFrame = [auctionid: string,
Hi ALL
I'm new to Spark. I'm doing some research to see if spark streaming can
solve my problem. I don't want to keep per-key state,b/c my data set is
very huge, it not viable to keep all per key state in memory.Instead, i
want to have a bloom filter based state. Does it possible to achieve this
50 matches
Mail list logo