So after doing some more research I found the root cause of the problem. The
bucket name we were using contained an underscore '_'. This goes against the
new requirements for naming buckets. Using a bucket that is not named with
an underscore solved the issue.
If anyone else runs into this
The code does not run as follows
../bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--verbose \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
../lib/spark-examples*.jar \
100
Exception in thread
Hi Sourav,
I will take a look to that too, thanks a lot for your help
Greetings,
Juan
2014-07-30 10:58 GMT+02:00 Sourav Chandra sourav.chan...@livestream.com:
While running application set this
-Dsun.io.serialization.extendedDebugInfo=true
This is applciable post java 1.6 version
On
You may be interested in https://github.com/OryxProject/oryx which is
at heart exactly lambda architecture on Spark Streaming. With ML
pipelines on top. The architecture diagram and a peek at the code may
give you a good example of how this could be implemented. I choose to
view the batch layer as
Did I describe the problem not clearly? Is anyone familiar to softmax
regression?
Thanks.
Cui xp.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html
Sent from the
You are running a Continuous Compilation. AFAIK, it runs in an infinite loop
and will compile only the modified files. For compiling with maven, have a
look at these steps -
https://spark.apache.org/docs/latest/building-with-maven.html
Thanks,
Visakh
--
View this message in context:
Thanks for your reply.
I think that the problem was that SparkR tried to serialize the whole
environment. Mind that the large dataframe was part of it. So every
worker received their slice / partition (which is very small) plus the
whole thing!
So I deleted the large dataframe and list before
I've seen a couple of issues posted about this, but I never saw a resolution.
When I'm using Spark 1.0.2 (and the spark-submit script to submit my jobs) and
AWS SDK 1.8.7, I get the stack trace below. However, if I drop back to AWS SDK
1.3.26 (or anything from the AWS SDK 1.4.* family) then
Apologies but we had placed the settings for downloading the slides to Seattle
Spark Meetup members only - but actually meant to share with everyone. We have
since fixed this and now you can download it. HTH!
On August 14, 2014 at 18:14:35, Denny Lee (denny.g@gmail.com) wrote:
For
Hi Andrew,
I'm running something close to the present master (I compiled several days
ago) but am having some trouble viewing history.
I set spark.eventLog.dir to true, but continually receive the error
message (via the web UI) Application history not found...No event logs
found for application
I've been using the standalone cluster all this time and it worked fine.
Recently I'm using another Spark cluster that is based on YARN and I've not
experience with YARN.
The YARN cluster has 10 nodes and a total memory of 480G.
I'm having trouble starting the spark-shell with enough memory.
I'm
Hi,
Ok, I was specifying --master local. I changed that to --master
spark://localhostname:7077 and am now able to see the completed
applications. It provides summary stats about runtime and memory usage,
which is sufficient for me at this time.
However it doesn't seem to archive the info in
Hi Soumya,
The driver's console output prints out how much memory is actually granted
to each executor, so from there you can verify how much memory the
executors are actually getting. You should use the '--executor-memory'
argument in spark-shell. For instance, assuming each node has 48G of
Hi 齐忠,
Thanks for reporting this. You're correct that the default deploy mode is
client. However, this seems to be a bug in the YARN integration code; we
should not throw null pointer exception in any case. What version of Spark
are you using?
Andrew
2014-08-15 0:23 GMT-07:00 centerqi hu
Hi guys,
We have a use case where we need to use consecutive data points to predict
the status. (yes, like using time series data to predict the machine
failure). Is there a straight-forward way to do this in Spark Streaming?
If all consecutive data points are in one batch, it's not complicated
Hi Cui
You can take a look at multinomial logistic regression PR I created.
https://github.com/apache/spark/pull/1379
Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
I just checked the YARN config and looks like I need to change this value.
Should be upgraded to 48G (the max memory allocated to YARN) per node ?
property
nameyarn.scheduler.maximum-allocation-mb/name
value6144/value
sourcejava.io.BufferedInputStream@2e7e1ee/source
/property
On Fri, Aug 15,
We generally recommend setting yarn.scheduler.maximum-allocation-mbto the
maximum node capacity.
-Sandy
On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:
I just checked the YARN config and looks like I need to change this value.
Should be upgraded to 48G (the
Is it practical to maintain a hardware context on each of the worker hosts in
Spark? In my particular problem I have an OpenCL (or JavaCL) context which has
two things associated with it:
- Data stored on a GPU
- Code compiled for the GPU
If the context goes away, the data is lost and the
@Brad
Your configuration looks alright to me. We parse both file:/ and
file:/// the same way so that shouldn't matter. I just tried this on the
latest master and verified that it works for me. Can you dig into the
directory /tmp/spark-events/ml-pipeline-1408117588599 to make sure that
it's not
Folks,
I wrote the following wrapper on top on combineByKey. The RDD is of
Array[Any] and I am extracting a field at a given index for combining.
There are two ways in which I tried this:
Option A: leave colIndex abstract in Aggregator class and define in derived
object Aggtor with value -1. It
DB,
Did you compare softmax regression with one-vs-all and found that softmax
is better ?
one-vs-all can be implemented as a wrapper over binary classifier that we
have in mllib...I am curious if softmax multinomial is better on most cases
or is it worthwhile to add a one vs all version of mlor
Hi Debasish,
I didn't try one-vs-all vs softmax regression. One issue is that for
one-vs-all, we have to train k classifiers for k classes problem. The
training time will be k times longer.
Sincerely,
DB Tsai
---
My Blog:
Hi,
Are there any experiments detailing the performance hit due to HDFS
checkpoint in ALS ?
As we scale to large ranks with more ratings, I believe we have to cut the
RDD lineage to safe guard against the lineage issue...
Thanks.
Deb
After changing the allocation I'm getting the following in my logs. No idea
what this means.
14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
appMasterRpcPort: -1
appStartTime: 1408131861372
yarnAppState: ACCEPTED
14/08/15 15:44:34 INFO
Sandy and others:
Is there a single source of Yarn/Hadoop properties that should be
set or reset for running Spark on Yarn?
We've sort of stumbled through one property after another, and
(unless there's an update I've not yet seen) CDH5 Spark-related
properties
Hi All,
I am just trying to save the kafka dstream to hadoop as followed
val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)
dStream.saveAsHadoopFiles(hdfsDataUrl, data)
It throws the following exception. What am I doing wrong?
14/08/15 14:30:09 ERROR
Hi Spark community,
At Adobe Research, we're happy to open source a prototype
technology called Spindle we've been developing over
the past few months for processing analytics queries with Spark.
Please take a look at the repository on GitHub at
https://github.com/adobe-research/spindle,
and we
Somewhere, your function has a reference to the Hadoop JobConf object
and is trying to send that to the workers. It's not in this code you
pasted so must be from something slightly different?
It shouldn't need to send that around and in fact it can't be
serialized as you see. If you need a Hadoop
Guoqiang reported some results in his PRs
https://github.com/apache/spark/pull/828 and
https://github.com/apache/spark/pull/929 . But this is really
problem-dependent. -Xiangrui
On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das debasish.da...@gmail.com wrote:
Hi,
Are there any experiments
Look this is the whole program. I am not trying to serialize the JobConf.
def main(args: Array[String]) {
try {
val properties = getProperties(settings.properties)
StreamingExamples.setStreamingLogLevels()
val zkQuorum = properties.get(zookeeper.list).toString()
val
Hi All,
I have a mlib model:
val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth)
I see model has following methods:algo asInstanceOf isInstanceOf
predicttoString topNode
model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0,
Since SqlContext supports less SQL than Hive (if I understand correctly), I
plan to run more queries by hql. However, is that possible to create some
tables as Parquet in hql? What kind of commands should I use? Thanks in
advance for any information.
--
View this message in context:
I just discovered that the Distinct call is working as expected when I run a
driver through spark-submit. This is only an issue in the REPL environment.
Very strange...
--
View this message in context:
I have a bunch of JSON files stored in HDFS that I want to read in, modify,
and write back out. I'm new to all this and am not sure if this is even the
right thing to do.
Basically, my JSON files contain my raw data, and I want to calculate some
derived data and add is to the existing data.
So
Yes, you can write to Parquet tables. On Spark 1.0.2 all I had to do was
include the parquet-hive-bundle-1.5.0.jar on my classpath.
From: lycmailto:yanchen@huawei.com
Sent: ?Friday?, ?August? ?15?, ?2014 ?7?:?30? ?PM
To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
if I reduce the app to the following code then I don't see the exception. It
creates the hadoop files but they are empty! The DStream doesn't get written
out to the files!
def main(args: Array[String]) {
try {
val properties = getProperties(settings.properties)
My use case as mentioned below.
1. Read input data from local file system using sparkContext.textFile(input
path).
2. partition the input data(80 million records) into partitions using
RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer
function. Without using coalesce() or
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers
alexander.rigg...@gmail.com wrote:
To perform the page rank I have to create a graph object, adding the edges
by setting sourceID=id and distID=brand. In GraphLab there is function: g =
SGraph().add_edges(data, src_field='id',
Thank you for your reply.
Do you know where I can find some detailed information about how to use
Parquet in HiveContext?
Any information is appreciated.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12216.html
I am getting the following error while doing SPARK_HADOOP_VERSION=2.3.0
sbt/sbt/package
java.io.IOException: Cannot run program
/home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No
such file or directory
...lots of errors
[error] (core/compile:compile)
41 matches
Mail list logo