Hello, here's a simple program that demonstrates my problem:
Is keyavg = rdd.values().reduce(sum) / rdd.count() inside stats calculated
one time per partition or it's just once? I guess another way to ask the
same question is DStream.transform() is called on the driver node or not?
What would
It doesn't seem to be Parquet 1.7.0 since the package name isn't under
org.apache.parquet (1.7.0 is the first official Apache release of
Parquet). The version you were using is probably Parquet 1.6.0rc3
according to the line number information:
Yes! I was being dumb, should have caught that earlier, thank you Cheng Lian
On Fri, Aug 7, 2015 at 4:25 PM, Cheng Lian lian.cs@gmail.com wrote:
It doesn't seem to be Parquet 1.7.0 since the package name isn't under
org.apache.parquet (1.7.0 is the first official Apache release of
Hi,
Spark UI or logs don't provide the situation of cluster. However, you can
use Ganglia to monitor the situation of cluster. In spark-ec2, there is an
option to install ganglia automatically.
If you use CDH, you can also use Cloudera manager.
Cheers
Gen
On Sat, Aug 8, 2015 at 6:06 AM, Xiao
I am running in yarn-client mode and trying to execute network word count
example. When I connect through nc I see the following in spark app logs:
Exception in thread main java.lang.AssertionError: assertion failed: The
checkpoint directory has not been set. Please use
Hi,
In fact, Pyspark use
org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/)
to transform object of Hbase result to python string.
Spark update these two scripts recently. However, they are not included in
the official release of spark. So you
Hi,
It depends on the problem that you work on.
Just as python and R, Mllib focuses on machine learning and SparkR will
focus on statistics, if SparkR follow the way of R.
For instance, If you want to use glm to analyse data:
1. if you are interested only in parameters of model, and use this
Have you tried to do what its suggesting?
If you want to learn more about checkpointing, you can see the programming
guide -
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
For more in depth understanding, you can see my talk -
You can use Spark Streaming's updateStateByKey to do arbitrary
sessionization.
See the example -
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
All it does is store a single number (count of each word seeing
I have a tens of million records, which is customer ID and city ID pair.
There are tens of millions of unique customer ID, and only a few hundreds
unique city ID. I want to do a merge to get all city ID aggregated for a
specific customer ID, and pull back all records. I want to do this using
group
I'm trying to build spark 1.4.1 against CDH 5.3.2. I created a profile
called cdh5.3.2 in spark_parent.pom, made some changes for
sql/hive/v0.13.1, and the build finished successfully.
Here is my problem:
- If I run `mvn -Pcdh5.3.2,yarn,hive install`, the artifacts are
installed into my
Hi,
A silly question here. The Driver Web UI dies when the spark-submit program
finish. I would like some time to analyze after the program ends, as the page
does not refresh it self, when I hit F5 I lose all the info.
Thanks,
Saif
Hello everyone,
this is my first message ever to a mailing list so please pardon me if for
some reason I'm violating the etiquette.
I have a problem with rebroadcasting a variable. How it should work is not
well documented so I could find only a few and simple example to understand
how it should
Hello,
Is there a way to estimate the approximate size of a dataframe? I know we
can cache and look at the size in UI but I'm trying to do this
programatically. With RDD, I can sample and sum up size using
SizeEstimator. Then extrapolate it to the entire RDD. That will give me
approx size of RDD.
One possible solution is to spark-submit with --driver-class-path and list
all recursive dependencies. This is fragile and error prone.
Non-working alternatives (used in SparkSubmit.scala AFTER arguments parser
is initialized):
spark-submit --packages ...
spark-submit --jars ...
Offending commit is :
[SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races.
https://github.com/apache/spark/commit/e72c16e30d85cdc394d318b5551698885cfda9b8
--
View this message in context:
Exactly!
The sharing part is used in the Spark Notebook (this one
https://github.com/andypetrella/spark-notebook/blob/master/notebooks/Tachyon%20Test.snb)
so we can share stuffs between notebooks which are different SparkContext
(in diff JVM).
OTOH, we have a project that creates micro services
Hi,
The issue only seems to happen when trying to access spark via the SparkSQL
Thrift Server interface.
Does anyone know a fix?
james
From: Wu, Walt Disney james.c...@disney.commailto:james.c...@disney.com
Date: Friday, August 7, 2015 at 12:40 PM
To:
I’m having some difficulty getting the desired results fromthe Spark Python
example hbase_inputformat.py. I’m running with CDH5.4, hbaseVersion 1.0.0,
Spark v 1.3.0 Using Python version 2.6.6
I followed the example to create a test HBase table. Here’sthe data from the
table I created –
In master branch, build/sbt-launch-lib.bash has the following:
URL1=
https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
I verified that the following exists:
Thanks Calvin - much appreciated !
-Abhishek-
On Aug 7, 2015, at 11:11 AM, Calvin Jia jia.cal...@gmail.com wrote:
Hi Abhishek,
Here's a production use case that may interest you:
http://www.meetup.com/Tachyon/events/222485713/
Baidu is using Tachyon to manage more than 100 nodes in
That's the correct URL. Recent change? The last time I looked, earlier this
week, it still had the obsolete artifactory URL for URL1 ;)
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler
Looks like Sean fixed it:
[SPARK-9633] [BUILD] SBT download locations outdated; need an update
Cheers
On Fri, Aug 7, 2015 at 3:22 PM, Dean Wampler deanwamp...@gmail.com wrote:
That's the correct URL. Recent change? The last time I looked, earlier
this week, it still had the obsolete
Hi,
I've been trying to track down some problems with Spark reads being very
slow with s3n:// URIs (NativeS3FileSystem). After some digging around, I
realized that this file system implementation fetches the entire file,
which isn't really a Spark problem, but it really slows down things when
Hi all,
I was running some Hive/spark job on hadoop cluster. I want to see how spark
helps improve not only the elapsed time but also the total CPU consumption.
For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when the
job finishes. But I didn't find any CPU stats for Spark
Hi all,
I have a partitioned parquet table (very small table with only 2
partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I
applied this patch to spark [SPARK-7743] so I assume that spark can read
parquet files normally, however, I'm getting this when trying to do a
simple
Yes, NullPointerExceptions are pretty common in Spark (or, rather, I seem
to encounter them a lot!) but can occur for a few different reasons. Could
you add some more detail, like what the schema is for the data, or the code
you're using to read it?
On Fri, Aug 7, 2015 at 3:20 PM, Jerrick Hoang
Hi, all spark processes are saved in the Spark History Server
look at your host on port 18080 instead of 4040
François
Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit :
Hi,
A silly question here. The Driver Web UI dies when the spark-submit
program finish. I would like some time
Hello, thank you, but that port is unreachable for me. Can you please share
where can I find that port equivalent in my environment?
Thank you
Saif
From: François Pelletier [mailto:newslett...@francoispelletier.org]
Sent: Friday, August 07, 2015 4:38 PM
To: user@spark.apache.org
Subject: Re:
Hien,
Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
going to use for workflow scheduling? Is there something else that's going
to replace Azkaban?
On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu yuzhih...@gmail.com wrote:
In my opinion, choosing some particular project
There seems to be a version mismatch somewhere. You can try and find out
the cause with debug serialization information. I think the jvm flag
-Dsun.io.serialization.extendedDebugInfo=true should help.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Check out Reifier at
hi,
if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
doesn’t that make it a deterministic/classical gradient descent rather
than a SGD?
Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
all rows. In the spirit of SGD, shouldn’t the default be the fraction
that
However, it's weird that the partition discovery job only spawns 2
tasks. It should use the default parallelism, which is probably 8
according to the logs of the next Parquet reading job. Partition
discovery is already done in a distributed manner via a Spark job. But
the parallelism is
HI,
I'm a new spark user,nowdays,I meet a wired erron happeded in our
cluster.
I depoly spark-1.3.1 and cdh5 on my cluster,weeks ago ,I depoly namenode
HA on it.
After that , my spark job meet error when I use JAVA-API,like this:
Is there any reason that historyserver use another property for the event
log dir ? Thanks
Hi all,
I want to have some nesting structure from the existing columns of
the dataframe.
For that,,I am trying to transform a DF in the following way,but couldn't
do it.
scala df.printSchema
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting
the document for analysis?
Hi Philip,
Thanks for providing the log file. It seems that most of the time are
spent on partition discovery. The code snippet you provided actually
issues two jobs. The first one is for listing the input directories to
find out all leaf directories (and this actually requires listing all
Hi,
I am running spark on YARN on the CDH5.3.2 stack. I have created a new user
to own and run a testing environment, however when using this user
applications I submit to yarn never begin to run, even if they are the
exact same application that is successful with another user?
Has anyone seen
They are actually the same thing, LongType. `long` is friendly for
developer, `bigint` is friendly for database guy, maybe data
scientists.
On Thu, Jul 23, 2015 at 11:33 PM, Sun, Rui rui@intel.com wrote:
printSchema calls StructField. buildFormattedString() to output schema
information.
Is there any workaround to distribute non-serializable object for RDD
transformation or broadcast variable ?
Say I have an object of class C which is not serializable. Class C is in a
jar package, I have no control on it. Now I need to distribute it either by
rdd transformation or by broadcast.
You'll have a lot less hassle using the AWS EMR instances with Spark 1.4.1 for
now, until the spark_ec2.py scripts move to Hadoop 2.7.1, at the moment I'm
pretty sure it's only using Hadoop 2.4
The EMR setup with Spark lets you use s3:// URIs with IAM roles
Ewan
-Original Message-
I have ended up with the following piece of code but is turns out to be
really slow... Any other ideas provided that I can only use MLlib 1.2?
val data = test11.map(x= ((x(0) , x(1)) , x(2))).groupByKey().map(x=
(x._1 , x._2.toArray)).map{x=
var lt : Array[Double] = new
Hi all,
I am trying to figure out how to perform equivalent of Session windows (as
mentioned in https://cloud.google.com/dataflow/model/windowing) using spark
streaming. Is it even possible (i.e. possible to do efficiently at scale). Just
to expand on the definition:
Taken from the google
No, here's an example:
COL1 COL2
a one
b two
a two
c three
StringIndexer.setInputCol(COL1).setOutputCol(SI1) -
(0- a, 1-b,2-c)
SI1
0
1
0
2
StringIndexer.setInputCol(COL2).setOutputCol(SI2) -
(0- one, 1-two, 2-three)
SI1
0
1
1
2
Hi I was trying to use stronglyconnectcomponents ()
Given a DAG is graph I was supposed to get back list of stronglyconnected l
comps .
def main(args: Array[String])
{
val vertexArray = Array(
(1L, (Alice, 28)),
(2L, (Bob, 27)),
(3L, (Charlie, 65)),
(4L, (David, 42)),
(5L, (Ed,
I am doing it by creating a new data frame out of the fields to be nested
and then join with the original DF.
Looking for some optimized solution here.
On Fri, Aug 7, 2015 at 2:06 PM, Rishabh Bhardwaj rbnex...@gmail.com wrote:
Hi all,
I want to have some nesting structure from the existing
Hi
I am new to spark and I need to use the clustering functionality to
process large dataset.
There are between 50k and 1mil objects to cluster. However the problem
is that the optimal number of clusters is unknown. we cannot even
estimate a range, except we know there are N objects.
Hi all ,
Is the Dataframe support the insert operation , like sqlContext.sql(insert
into table1 xxx select xxx from table2) ?
guoqing0...@yahoo.com.hk
If the object cannot be serialized, then I don't think broadcast will make
it magically serializable. You can't transfer data structures between nodes
without serializing them somehow.
On Fri, Aug 7, 2015 at 7:31 AM, Sujit Pal sujitatgt...@gmail.com wrote:
Hi Hao,
I think sc.broadcast will
If the object is something like an utility object (say a DB connection
handler), I often use:
@transient lazy val someObj = MyFactory.getObj(...)
So basically `@transient` tell the closure cleaner don't serialize this,
and the `lazy val` allows it to be initiated on each executor upon its
This blog outlines a few things that make Spark faster than MapReduce -
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
On Fri, Aug 7, 2015 at 9:13 AM, Muler mulugeta.abe...@gmail.com wrote:
Consider the classic word count application over a 4 node cluster with a
sizable
See post for detailed explanation of you problem:
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html
--
View this message in context:
Is this the sort of problem spark can accommodate?
I need to compare 10,000 matrices with each other (10^10 comparison). The
matrices are 100x10 (10^7 int values).
I have 10 machines with 2 to 8 cores (8-32 processors).
All machines have to
- contribute to matrices generation (a
1) Spark only needs to shuffle when data needs to be partitioned around the
workers in an all-to-all fashion.
2) Multi-stage jobs that would normally require several map reduce jobs,
thus causing data to be dumped to disk between the jobs can be cached in
memory.
Hi,
I'm looking for open source workflow tools/engines that allow us to
schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
wanted to check with people here to see what they are using today.
Some of the
Hao,
I’d say there are few possible ways to achieve that:
1. Use KryoSerializer.
The flaw of KryoSerializer is that current version (2.21) has an issue with
internal state and it might not work for some objects. Spark get kryo
dependency as transitive through chill and it’ll not be resolved
Looks like Oozie can satisfy most of your requirements.
On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone vikramk...@gmail.com wrote:
Hi,
I'm looking for open source workflow tools/engines that allow us to
schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
of
Consider the classic word count application over a 4 node cluster with a
sizable working data. What makes Spark ran faster than MapReduce
considering that Spark also has to write to disk during shuffle?
In general the simplest way is that you can use the Dynamo Java API as is and
call it inside a map(), and use the asynchronous put() Dynamo api call .
On Aug 7, 2015, at 9:08 AM, Yasemin Kaya godo...@gmail.com wrote:
Hi,
Is there a way using DynamoDB in spark application? I have to
Thanks, I also confirmed that the partition discovery is slow by writing a
non-Spark application that uses the parquet library directly to load that
partitions.
It's so slow that my colleague's Python application can read the entire
contents of all the parquet data files faster than my
Thanks for the suggestion Hien. I'm curious why not azkaban from linkedin.
From what I read online Oozie was very cumbersome to setup and use compared
to azkaban. Since you are from linkedin wanted to get some perspective on
what it lacks compared to Oozie. Ease of use is very important more than
Simone, here are some thoughts. Please check out the understanding closures
section of the Spark Programming Guide. Secondly, broadcast variables do not
propagate updates to the underlying data. You must either create a new
broadcast variable or alternately if you simply wish to accumulate
Sounds reasonable to me, feel free to create a JIRA (and PR if you're up
for it) so we can see what others think!
On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler
gerald.loeff...@googlemail.com wrote:
hi,
if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
doesn’t that make it
Yep, I think that's what Gerald is saying and they are proposing to default
miniBatchFraction = (1 / numInstances). Is that correct?
On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu rotationsymmetr...@gmail.com
wrote:
I think in the SGD algorithm, the mini batch sample is done without
replacement.
Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex
query running for the dataset, which is stored in nest structure with Array of
SparkR and MLlib are becoming more integrated (we recently added R formula
support) but the integration is still quite small. If you learn R and
SparkR, you will not be able to leverage most of the distributed algorithms
in MLlib (e.g. all the algorithms you cited). However, you could use the
Spark is an in-memory engine and attempts to do computation in-memory.
Tachyon is memory-centeric distributed storage, OK, but how would that help
ran Spark faster?
Hi,
I'm trying to run the hive thrift server in debug mode. I've tried to simply
pass -Xdebug
-Xrunjdwp:transport=dt_socket,address=127.0.0.1:,server=y,suspend=n to
start-thriftserver.sh as a driver option, but it doesn't seem to host a server.
I've then tried to edit the various shell
Spark 1.4.1 depends on:
akka.version2.3.4-spark/akka.version
Is it possible that your standalone cluster has another version of akka ?
Cheers
On Fri, Aug 7, 2015 at 10:48 AM, Jeff Jones jjo...@adaptivebiotech.com
wrote:
Thanks. Added this to both the client and the master but still not
check on which ip/port master listens
netstat -a -t --numeric-ports
On 7 August 2015 at 20:48, Jeff Jones jjo...@adaptivebiotech.com wrote:
Thanks. Added this to both the client and the master but still not getting
any more information. I confirmed the flag with ps.
jjones53222 2.7
Hey all.
I was trying to understand Spark Internals by looking in to (and hacking)
the code.
I was trying to explore the buckets which are generated
when we partition the output of each map task and then let the reduce side
fetch them on the basis of paritionId. I went into the write() method of
Hi, Michael:
I am not sure how spark-avro can help in this case.
My understanding is that to use Spark-avro, I have to translate all the logic
from this big Hive query into Spark code, right?
If I have this big Hive query, how I can use spark-avro to run the query?
Thanks
Yong
From:
Looks like you would get better response on Tachyon's mailing list:
https://groups.google.com/forum/?fromgroups#!forum/tachyon-users
Cheers
On Fri, Aug 7, 2015 at 9:56 AM, Abhishek R. Singh
abhis...@tetrationanalytics.com wrote:
Do people use Tachyon in production, or is it experimental
Hi all,
I am trying to figure out how to perform equivalent of Session windows (as
mentioned in https://cloud.google.com/dataflow/model/windowing) using spark
streaming. Is it even possible (i.e. possible to do efficiently at scale). Just
to expand on the definition:
Taken from the google
Please community, I'd really appreciate your opinion on this topic.
Best regards,
Roberto
-- Forwarded message --
From: Roberto Coluccio roberto.coluc...@gmail.com
Date: Sat, Jul 25, 2015 at 6:28 PM
Subject: [Spark + Hive + EMR + S3] Issue when reading from Hive external
table
Check also falcon in combination with oozie
Le ven. 7 août 2015 à 17:51, Hien Luu h...@linkedin.com.invalid a écrit :
Looks like Oozie can satisfy most of your requirements.
On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone vikramk...@gmail.com wrote:
Hi,
I'm looking for open source workflow
Hi,
I am using Spark SQL to run some queries on a set of avro data. Somehow I am
getting this error
0: jdbc:hive2://n7-z01-0a2a1453 select count(*) from flume_test;
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task
3 in stage 26.0 failed 4 times, most recent
Do people use Tachyon in production, or is it experimental grade still?
Regards,
Abhishek
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
From what I heard (an ex-coworker who is Oozie committer), Azkaban is being
phased out at LinkedIn because of scalability issues (though UI-wise,
Azkaban seems better).
Vikram:
I suggest you do more research in related projects (maybe using their
mailing lists).
Disclaimer: I don't work for
You can register your data as a table using this library and then query it
using HiveQL
CREATE TEMPORARY TABLE episodes
USING com.databricks.spark.avro
OPTIONS (path src/test/resources/episodes.avro)
On Fri, Aug 7, 2015 at 11:42 AM, java8964 java8...@hotmail.com wrote:
Hi, Michael:
I am not
Thanks. Added this to both the client and the master but still not getting any
more information. I confirmed the flag with ps.
jjones53222 2.7 0.1 19399412 549656 pts/3 Sl 17:17 0:44
/opt/jdk1.8/bin/java -cp
Hi,
Tachyon http://tachyon-project.org manages memory off heap which can help
prevent long GC pauses. Also, using Tachyon will allow the data to be
shared between Spark jobs if they use the same dataset.
Here's http://www.meetup.com/Tachyon/events/222485713/ a production use
case where Baidu
Oh ok. That's a good enough reason against azkaban then. So looks like
Oozie is the best choice here.
On Friday, August 7, 2015, Ted Yu yuzhih...@gmail.com wrote:
From what I heard (an ex-coworker who is Oozie committer), Azkaban is
being phased out at LinkedIn because of scalability issues
I think in the SGD algorithm, the mini batch sample is done without
replacement. So with fraction=1, then all the rows will be sampled
exactly once to form the miniBatch, resulting to the
deterministic/classical case.
On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang fli...@databricks.com wrote:
Thanx Jay.
2015-08-07 19:25 GMT+03:00 Jay Vyas jayunit100.apa...@gmail.com:
In general the simplest way is that you can use the Dynamo Java API as is
and call it inside a map(), and use the asynchronous put() Dynamo api call
.
On Aug 7, 2015, at 9:08 AM, Yasemin Kaya godo...@gmail.com
Good to know that.
Let me research it and give it a try.
Thanks
Yong
From: mich...@databricks.com
Date: Fri, 7 Aug 2015 11:44:48 -0700
Subject: Re: Spark SQL query AVRO file
To: java8...@hotmail.com
CC: user@spark.apache.org
You can register your data as a table using this library and then query
Verzonden vanaf mijn Sony Xperia™-smartphone
iceback schreef
Is this the sort of problem spark can accommodate?
I need to compare 10,000 matrices with each other (10^10 comparison). The
matrices are 100x10 (10^7 int values).
I have 10 machines with 2 to 8 cores (8-32
Verzonden vanaf mijn Sony Xperia™-smartphone
saif.a.ell...@wellsfargo.com schreef
!-- /* Font Definitions */ @font-face {font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */
Good point; I agree that defaulting to online SGD (single example per
iteration) would be a poor UX due to performance.
On Fri, Aug 7, 2015 at 12:44 PM, Meihua Wu rotationsymmetr...@gmail.com
wrote:
Feynman, thanks for clarifying.
If we default miniBatchFraction = (1 / numInstances), then we
Hi,
I got into a situation where a prior add jar command causing Spark SQL stops
to work for all users.
Does anyone know how to fix the issue?
Regards,
james
From: Wu, Walt Disney james.c...@disney.commailto:james.c...@disney.com
Date: Friday, August 7, 2015 at 10:29 AM
To:
Feynman, thanks for clarifying.
If we default miniBatchFraction = (1 / numInstances), then we will
only hit one row for every iteration of SGD regardless the number of
partitions and executors. In other words the parallelism provided by
the RDD is lost in this approach. I think this is something
Verzonden vanaf mijn Sony Xperia™-smartphone
Meihua Wu schreef
Feynman, thanks for clarifying.
If we default miniBatchFraction = (1 / numInstances), then we will
only hit one row for every iteration of SGD regardless the number of
partitions and executors. In other words the
look at
spark.history.ui.port, if you use standalone
spark.yarn.historyServer.address, if you use YARN
in your Spark config file
Mine is located at
/etc/spark/conf/spark-defaults.conf
If you use Apache Ambari you can find this settings in the Spark /
Configs / Advanced spark-defaults tab
I Recently downloaded spark package 1.4.0:
A build of Spark with sbt/sbt clean assembly failed with message Error:
Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar
Upon investigation I figured out that sbt-launch-0.13.7.jar is downloaded
at build time and that it contained the the
Im interested in machine learning on time series.
In our environment we have lot of metric data continuously coming from
agents. Data are stored in Cassandra. Is it possible to set up spark that
would use machine learning on previous data and new incoming data?
--
View this message in
Hi,
Is there a way using DynamoDB in spark application? I have to persist my
results to DynamoDB.
Thanx,
yasemin
--
hiç ender hiç
Hi
The graph returned by SCC (strong_graphs in your code) has vertex data where
each vertex in a component is assigned the lowest vertex id of the
component. So if you have 6 vertices (1 to 6) and 2 strongly connected
components (1 and 3, and 2,4,5 and 6) then the strongly connected components
Hi all,
I am getting an exception when trying to execute a Spark Job that is using
the new Phoenix 4.5 spark connector. The application works very well in my
local machine, but fails to run in a cluster environment on top of yarn.
The cluster is a Cloudera CDH 5.4.4 with HBase 1.0.0 and Phoenix
Looking at the callstack and diffs between 1.3.1 and 1.4.1-rc4, I see
something that could be relevant to the issue.
1) Callstack tells that log4j manager gets initialized and uses default java
context class loader. This context class loader should probably be
MutableURLClassLoader from spark but
1 - 100 of 102 matches
Mail list logo