It is not supported in spark to specify both principal and proxy-user. You
need to either use proxy-user or use principal.
Seems currently spark only check that from spark submit arguments but
ignore the configuration in spark-defaults.xml
if (proxyUser != null && principal != null) {
In SparkSubmit, there is less work for yarn-client than that for yarn-cluster.
Basically prepare some spark configurations into system prop , for example,
information on additional resources required by the application that need to be
distributed to the cluster. These configurations will be
Hi,
I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with
exit code 143 .
except one job where I am using unionAll and groupBy operation on multiple
columns .
Please advice me the options to optimize it .
The one option which I am using it now
--conf
Reminder: gen-idea has been removed in the master. See:
commit a172e11cba6f917baf5bd6c4f83dc6689932de9a
Author: Luciano Resende
Date: Mon Apr 4 16:55:59 2016 -0700
[SPARK-14366] Remove sbt-idea plugin
## What changes were proposed in this pull request?
Hi
I am new to spark & pyspark.
I am reading a small csv file (~40k rows) into a dataframe.
from pyspark.sql import functions as F
df =
sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('/tmp/sm.csv')
df = df.withColumn('verified',
Dear All,
I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything
was OK, but , it failed to start httpd at the end of the installation. I
followed exactly the instruction and I repeated the process many times, but
there is no luck.
-
[timing] rstudio setup: 00h
>
> Julia can pick the env var, and set the system properties or directly fill
> the configurations into a SparkConf, and then create a SparkContext
That's the point - just setting master to "yarn-client" doesn't work, even
in Java/Scala. E.g. following code in *Scala*:
val conf = new
You need to use `backticks` to reference columns that have non-standard
characters.
On Wed, Apr 13, 2016 at 6:56 AM, wrote:
> Hi,
>
> I am debugging a program, and for some reason, a line calling the
> following is failing:
>
> df.filter("sum(OpenAccounts) >
I get this error.
Who knows what does it mean?
Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result:
It is described in "Hadoop Definition Guild", chapter 3, FilePattern
https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch03.html#FilePatterns
Yong
From: pradeep1...@gmail.com
Date: Wed, 13 Apr 2016 18:56:58 +
Subject: how does sc.textFile translate regex in
actually how many tables are involved here.
what is the version of Hive used? Sorry I have no idea about Cloudera 5.5.1
spec.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
I am trying to understand on how spark's sc.textFile() works. I
specifically have the question on how it translates the paths with regex in
it.
For example:
files = sc.textFile("hdfs://:/file1/*/*/*/*.txt")
How does it find all the sub-directories and recurses to all the leaf
files. ? Is there
hi guys ,
i have this error after 5 hours of processing i make lot of joins 14 left
joins
with small table :
i saw in the spark ui and console log evrithing ok but when he save
last join i get this error
Py4JJavaError: An error occurred while calling o115.parquet. _metadata is
not a Parquet
Dear All,
I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything
was OK, but , it failed to start httpd at the end of the installation. I
followed exactly the instruction and I repeated the process many times, but
there is no luck.
-
[timing] rstudio setup: 00h
Hi Yong,
thanks for your response. As I said in my first email, I've tried both the
reference to the classpath resource (env/dev/log4j-executor.properties) as
the file:// protocol. Also, the driver logging is working fine and I'm
using the same kind of reference.
Below the content of my
These are the components
*java -versionjava version "1.8.0_77"*Java(TM) SE Runtime Environment
(build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
*hadoop versionHadoop 2.6.0*Subversion
https://git-wip-us.apache.org/repos/asf/hadoop.git -r
Is the env/dev/log4j-executor.properties file within your jar file? Is the path
matching with what you specified as env/dev/log4j-executor.properties?
If you read the log4j document here:
https://logging.apache.org/log4j/1.2/manual.html
When you specify the
Thanks for your response Ted. You're right, there was a typo. I changed it,
now I'm executing:
bin/spark-submit --master spark://localhost:7077 --conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
--conf
FYI, I wrote functionality to enable Lucene text analysis components to be used
to extract text features via a transformer in spark.ml pipelines.
Non-machine-learning uses supported too.
See my blog describing the capabilities, which are included in the open-source
spark-solr project:
Mich
Are you building your own releases from the source?
Which version of Scala?
Again, the builds seem to be ok and working, but I don’t want to hit some
‘gotcha’ if I could avoid it.
> On Apr 13, 2016, at 7:15 AM, Mich Talebzadeh
> wrote:
>
> Hi,
>
> I am
w.r.t. the effective storage level log, here is the JIRA which introduced
it:
[SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled
On Wed, Apr 13, 2016 at 7:43 AM, Patrick McGloin
wrote:
> Hi all,
>
> If I am using a Custom Receiver with
I have an existing cluster that I stand up via Docker images and CloudFormation
Templates on AWS. We are moving to EMR and AWS Data Pipeline process, and
having problems with metrics and log4j. We’ve sent a JSON configuration for
spark-log4j and spark-metrics. The log4j file seems to be
Looks like the support does not exist unless someone counter it and there is
a open JIRA.
https://issues.apache.org/jira/browse/SPARK-12909
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-acessing-secured-HDFS-tp26766p26778.html
Sent from the Apache
bq. --conf "spark.executor.extraJavaOptions=-Dlog4j.
configuration=env/dev/log4j-driver.properties"
I think the above may have a typo : you refer to log4j-driver.properties in
both arguments.
FYI
On Wed, Apr 13, 2016 at 8:09 AM, Carlos Rojas Matas
wrote:
> Hi guys,
>
>
Hi guys,
I'm trying to enable logging in the executors but with no luck.
According to the oficial documentation and several blogs, this should be
done passing the
"spark.executor.extraJavaOpts=-Dlog4j.configuration=[my-file]" to the
spark-submit tool. I've tried both sending a reference to a
Hi all,
If I am using a Custom Receiver with Storage Level set to StorageLevel.
MEMORY_ONLY_SER_2 and the WAL enabled I get this Warning in the logs:
16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: Storage level
replication 2 is unnecessary when write ahead log is enabled, change
to
Hi,
I am not sure this helps.
we use Spark 1.6 and Hive 2. I also use JDBC (beeline for Hive) plus
Oracle and Sybase. They all work fine.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Hi,
We have a hive table which gets data written to it by two partition keys,
day and hour.
We would like to stream the incoming files assince fileStream can only
listen on one directory we start a streaming job on the latest partition
and every hour kill it and start a new one on a newer
Hi,
I am debugging a program, and for some reason, a line calling the following is
failing:
df.filter("sum(OpenAccounts) > 5").show
It says it cannot find the column OpenAccounts, as if it was applying the sum()
function and looking for a column called like that, where there is not. This
Actually, same error occurred when I ran build/sbt compile or other commands.
After struggled for some time, I reminded that I used proxy to connect to
Internet. So set proxy to maven, everything seems OK. Just remind those who use
proxies.
--
Best regards,
Ze Jin
The StateSpec and the mapWithState method is only available in Spark 1.6.x
2016-04-13 11:34 GMT+02:00 Marco Mistroni :
> hi all
> i am trying to replicate the Streaming Wordcount example described here
>
>
>
Hello:
I'm using the history server to keep track of the applications I run in my
cluster. I'm using Spark with YARN.
When I run on application it finishes correctly even YARN says that it
finished. This is the result of the YARN Resource Manager API
{u'app': [{u'runningContainers': -1,
Hi,
I am trying to reduce the query performance. I am not sure how to go about in
shark/spark this. Here is my problem.
When I execute a query it is ran twice and here is summary. First is Filesink's
runjob and next is mapPartitionis executed.
1. Filesink uses only one job always is
Hi,
I have stopped working on s3n for a long time now. In case you are working
with parquet and writing files s3a is the only alternative to failures.
Otherwise why not use just s3://?
Regards,
Gourav
On Wed, Apr 13, 2016 at 12:17 PM, Steve Loughran
wrote:
>
> On 12
On 12 Apr 2016, at 22:05, Martin Eden
> wrote:
Hi everyone,
Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage
to process approx 1TB of strings inside gzipped parquet in about 50 mins on a
20 node cluster (8
It looks like all of that is building up to spark 2.0 (for random forests /
gbts / etc.). Ah well...thanks for your help. Was interesting digging into the
depths.
Date: Wed, 13 Apr 2016 09:48:32 +0100
Subject: Re: ML Random Forest Classifier
From: ja...@gluru.co
To: as...@live.com
CC:
hi all
i am trying to replicate the Streaming Wordcount example described here
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
in my build,sbt i have the following dependencies
.
libraryDependencies +=
I wrote in "spark-defaults.conf" spark.driver.extraClassPath '/dir'
or "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook"
/.../sparkling-water-1.6.1/bin/pysparkling \ --conf
spark.driver.extraClassPath='/.../sqljdbc41.jar'
Nothing works
--
View this message in context:
Hi Ashic,
Unfortunately I don't know how to work around that - I suggested this line
as it looked promising (I had considered it once before deciding to use a
different algorithm) but I never actually tried it.
Regards,
James
On 13 April 2016 at 02:29, Ashic Mahtab wrote:
>
Are you running from eclipse ?
If so add the *Hadoop_conf_dir* path to the classpath
And then you can access your hdfs directory as below
object sparkExample {
def main(args: Array[String]){
val logname = "///user/hduser/input/sample.txt"
val conf = new
Finally I tried setting the configuration manually using
sc.hadoopconfiguration.set
dfs.nameservices
dfs.ha.namenodes.hdpha
dfs.namenode.rpc-address.hdpha.n1
And it worked ,don't know why it was not reading these settings from file under
HADOOP_CONF_DIR
-Original Message-
There are DNS entries for both of my namenode
Ambarimaster is standby and it resolves to ip perfectly
Hdp231 is active and it also resolves to ip
Hdpha is my Hadoop HA cluster name
And hdfs-site.xml has entries related to these configuration
-Original Message-
From: "Jörn Franke"
Is the host in /etc/hosts ?
> On 13 Apr 2016, at 07:28, Amit Singh Hora wrote:
>
> I am trying to access directory in Hadoop from my Spark code on local
> machine.Hadoop is HA enabled .
>
> val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]")
> val
43 matches
Mail list logo