You may need to add the -Phadoop-2.4 profile. When building or release
packages for Hadoop 2.4 we use the following flags:
-Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn
- Patrick
On Thu, Mar 5, 2015 at 12:47 PM, Kelly, Jonathan jonat...@amazon.com wrote:
I confirmed that this has nothing to
It seems from the excerpt below that your cluster is set up to use the
Yarn ATS, and the code is failing in that path. I think you'll need to
apply the following patch to your Spark sources if you want this to
work:
https://github.com/apache/spark/pull/3938
On Thu, Mar 5, 2015 at 10:04 AM, Todd
That's probably a good thing to have, so I'll add it, but unfortunately it
did not help this issue. It looks like the hadoop-2.4 profile only sets
these properties, which don't seem like they would affect anything related
to Netty:
properties
hadoop.version2.4.0/hadoop.version
I confirmed that this has nothing to do with BigTop by running the same mvn
command directly in a fresh clone of the Spark package at the v1.2.1 tag. I
got the same exact error.
Jonathan Kelly
Elastic MapReduce - SDE
Port 99 (SEA35) 08.220.C2
From: Kelly, Jonathan Kelly
I am testing the Random Forest in Spark, but I have a question... If I train
for the second time, will update the decision trees created or these are
created anew ?. That is, does the system will continually learning for each
dataset or only the first?
Thanks for everything
--
View this
Hi Mitch,
I think it is normal. The network utilization will be high when there is
some shuffling process happening. After that, the network utilization
should come down, while each slave nodes will do the computation on the
partitions assigned to them. At least it is my understanding.
Best,
Hi, Ted,
Thanks for your reply. I noticed from the below link partitions.size will not
work for checking empty RDD in streams. It seems that the problem can be solved
in spark 1.3 which is no way to download at this time?
https://issues.apache.org/jira/browse/SPARK-5270
Best regards,
Cui Lin
I've downloaded spark 1.2.0 to my laptop. In the lib directory, it includes
spark-assembly-1.2.0-hadoop2.4.0.jar
When I spin up a cluster using the ec2 scripts with 1.2.0 (and set
--hadoop-major-version=2) I notice that in the lib directory for the
master/slaves the assembly is for
Hi David,
It is a great point. It is actually one of the reasons that my program is
slow. I found that the major cause of my program running slow is the huge
garbage collection time. I created too many small objects in the map
procedure which triggers GC mechanism frequently. After I improved my
Hint: Print() just gives a sample of what is in the data, and does not
enforce the processing on all the data (only the first partition of the rdd
is computed to get 10 items). Count() actually processes all the data. This
is all due to lazy eval, if you don't need to use all the data, don't
Thanks.
I was already setting those (and I checked they were in use through the
environment tab in the UI).
They were set at 10 times their default value: 6 and 1 respectively.
I'll start poking at spark.shuffle.io.retryWait.
Thanks!
On Wed, Mar 4, 2015 at 7:02 PM, Ted Yu
Please ignore my question, you can simply specify the root directory and it
looks like redshift takes care of the rest.
copy mobile
from 's3://BUCKET_NAME/'
credentials
json 's3://BUCKET_NAME/jsonpaths.json'
On Thu, Mar 5, 2015 at 3:33 PM, Mike Trienis mike.trie...@orcsol.com
wrote:
Hi
Hello,
Getting started with Spark.
Got JavaNetworkWordcount working on a 3 node cluster. netcat on with a
infinite loop printing random numbers 0-100
With a duration of 1sec, I do see a list of (word, count) values every
second. The list is limited to 10 values (as per the docs)
The count
In addition, you may need following patch if it is not in 1.2.1 to solve some
system property issue if you use HDP 2.2.
https://github.com/apache/spark/pull/3409
You can follow the following link to set hdp.version for java options.
PS, I will recommend you compress the data when you cache the RDD.
There will be some overhead in compression/decompression, and
serialization/deserialization, but it will help a lot for iterative
algorithms with ability to caching more data.
Sincerely,
DB Tsai
I meant from one app, yes.
Was asking this because our previous tuning experiment shows spark-on-yarn runs
faster when overloading workers with executors (i.e. if a worker has 4 cores,
creating 2 executors each use 4 cores will see a speed boost from 1 executor
with 4 cores).
I have found an
Hey guys,
Trying to build Spark 1.3 for Scala 2.11.
I'm running with the folllowng Maven command;
-DskipTests -Dscala-2.11 clean install package
*Exception*:
[ERROR] Failed to execute goal on project spark-core_2.10: Could not
resolve dependencies for project
Hi,
On Thu, Mar 5, 2015 at 10:58 PM, Ashish Mukherjee
ashish.mukher...@gmail.com wrote:
I understand Spark can be used with Hadoop or standalone. I have certain
questions related to use of the correct FS for Spark data.
What is the efficiency trade-off in feeding data to Spark from NFS v
See following thread for 1.3.0 release:
http://search-hadoop.com/m/JW1q5hV8c4
Looks like the release is around the corner.
On Thu, Mar 5, 2015 at 3:26 PM, Cui Lin cui@hds.com wrote:
Hi, Ted,
Thanks for your reply. I noticed from the below link partitions.size
will not work for
Currently we have implemented External Data Source API and are able to
push filters and projections.
Could you provide some info on how perhaps the joins could be pushed to
the original Data Source if both the data sources are from same database
*.*
First a disclaimer: This is an
Hi,
You can first establish a scala ide to develop and debug your spark program,
lets say, intellij idea or eclipse.
Thanks,
Sun.
fightf...@163.com
From: Xi Shen
Date: 2015-03-06 09:19
To: user@spark.apache.org
Subject: Spark code development practice
Hi,
I am new to Spark. I see every
Hi ,
I used spark-ec2 script to create ec2 cluster.
Now I am trying copy data from s3 into hdfs.
I am doing this
*root@ip-172-31-21-160 ephemeral-hdfs]$ bin/hadoop distcp
s3://xxx/home/mydata/small.sam
hdfs://ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1
Thanks guys, this is very useful :)
@Stephen, I know spark-shell will create a SC for me. But I don't
understand why we still need to do new SparkContext(...) in our code.
Shouldn't we get it from some where? e.g. SparkContext.get.
Another question, if I want my spark code to run in YARN later,
It works pretty fine for me with the script comes with 1.2.0 release.
Here's a few things which you can try:
- Add your s3 credentials to the core-site.xml
property namefs.s3.awsAccessKeyId/name
valueID/value/propertyproperty
namefs.s3.awsSecretAccessKey/name
valueSECRET/value/property
- Do a
Why not setup HDFS?
Thanks
Best Regards
On Thu, Mar 5, 2015 at 4:03 PM, didmar marin.did...@gmail.com wrote:
Hi,
I'm having a problem involving file permissions on the local filesystem.
On a first machine, I have two different users :
- launcher, which launches my job from an uber jar
Hi All,
I am receiving data from AWS Kinesis using Spark Streaming and am writing
the data collected in the dstream to s3 using output function:
dstreamData.saveAsTextFiles(s3n://XXX:XXX@/)
After the run the application for several seconds, I end up with a sequence
of directories in S3 that
I've never tried it, but I'm pretty sure in the very least you want
-Pscala-2.11 (not -D).
On Thu, Mar 5, 2015 at 4:46 PM, Night Wolf nightwolf...@gmail.com wrote:
Hey guys,
Trying to build Spark 1.3 for Scala 2.11.
I'm running with the folllowng Maven command;
-DskipTests -Dscala-2.11
Ah, and you may have to use dev/change-version-to-2.11.sh. (Again,
never tried compiling with scala 2.11.)
On Thu, Mar 5, 2015 at 4:52 PM, Marcelo Vanzin van...@cloudera.com wrote:
I've never tried it, but I'm pretty sure in the very least you want
-Pscala-2.11 (not -D).
On Thu, Mar 5, 2015
You can do want with lateral view explode, but what seems to be missing is
that jsonRDD converts json objects into structs (fixed keys with a fixed
order) and fields in a struct are accessed using a `.`
val myJson =
sqlContext.jsonRDD(sc.parallelize({foo:[{bar:1},{baz:2}]}
:: Nil))
One other caveat: While writing up this example I realized that we make
SparkPlan private and we are already packaging 1.3-RC3... So you'll need a
custom build of Spark for this to run. We'll fix this in the next release.
On Thu, Mar 5, 2015 at 5:26 PM, Michael Armbrust mich...@databricks.com
use the spark-shell command and the shell will open
type :paste abd then paste your code, after control-d
open spark-shell:
sparks/bin
./spark-shell
Verstuurd vanaf mijn iPhone
Op 6-mrt.-2015 om 02:28 heeft fightf...@163.com fightf...@163.com het
volgende geschreven:
Hi,
You can first
Is there any plans of supporting JSON arrays more fully? Take for example:
val myJson =
sqlContext.jsonRDD(List({foo:[{bar:1},{baz:2}]}))
myJson.registerTempTable(JsonTest)
I would like a way to pull out parts of the array data based on a key
sql(SELECT foo[bar] FROM JsonTest)
I am running Spark on a HortonWorks HDP Cluster. I have deployed there
prebuilt version but it is only for Spark 1.2.0 not 1.2.1 and there are a
few fixes and features in there that I would like to leverage.
I just downloaded the spark-1.2.1 source and built it to support Hadoop 2.6
by doing the
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks like SparkSQL is using ParquetOutputCommitter, which is subclass
of FileOutputCommitter.
On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor
You may exclude the log4j dependency while building. You can have a look at
this build file to see how to exclude libraries
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html
Thanks
Best Regards
On Thu, Mar 5, 2015 at 1:20
Yes, unfortunately that direct dependency makes this injection much more
difficult for saveAsParquetFile.
On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote:
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
However, Executors were dying when using Netty as well, so it is possible
that the OOM was occurring then too. It is also possible only one of your
Executors OOMs (due to a particularly large task) and the others display
various exceptions while trying to fetch the shuffle blocks from the failed
When you use KafkaUtils.createStream with StringDecoders, it will return
String objects inside your messages stream. To access the elements from the
json, you could do something like the following:
val mapStream = messages.map(x= {
val mapper = new ObjectMapper() with ScalaObjectMapper
We have two applications need to connect to Spark Sql thrift server.
The first application is developed in java. Having spark sql thrift server
running, we following the steps in this link
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC
and the
Thanks. I was about to submit a ticket for this :)
Also there's a ticket for sort-merge based groupbykey
https://issues.apache.org/jira/browse/SPARK-3461
BTW, any idea why run with netty didn't output OOM error messages? It's
very confusing in troubleshooting.
Jianshi
On Thu, Mar 5, 2015 at
Hello,
I understand Spark can be used with Hadoop or standalone. I have certain
questions related to use of the correct FS for Spark data.
What is the efficiency trade-off in feeding data to Spark from NFS v HDFS?
If one is not using Hadoop, is it still usual to house data in HDFS for
Spark to
Jackson 1.9.13? and codehaus.jackson.version? that's already set by
the profile hadoop-2.4.
On Thu, Mar 5, 2015 at 6:13 PM, Ted Yu yuzhih...@gmail.com wrote:
Please add the following to build command:
-Djackson.version=1.9.3
Cheers
On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist
Hi,
I am using CDH5.3.2 now for a Spark project. I got the following exception:
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
I used all the CDH5.3.2 jar files in my pom file to generate the application
jar file.
I'm running into an issue building Spark v1.2.1 (as well as the latest in
branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9,
which is not quite released yet). The build fails in the External Flume Sink
subproject with the following error:
[INFO] Compiling 5 Scala
That particular class you did find is under parquet/... which means it was
shaded. Did you build your application against a hadoop2.6 dependency? The
maven central repo only has 2.2 but HDP has its own repos.
On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist tsind...@gmail.com wrote:
I am running
Please add the following to build command:
-Djackson.version=1.9.3
Cheers
On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist tsind...@gmail.com wrote:
I am running Spark on a HortonWorks HDP Cluster. I have deployed there
prebuilt version but it is only for Spark 1.2.0 not 1.2.1 and there are a
few
@Victor,
I'm pretty sure I built it correctly, I specified -Dhadoop.version=2.6.0,
am I missing something here? Followed the docs on this but I'm open to
suggestions.
make-distribution.sh --name hadoop2.6 --tgz -Pyarn -Phadoop-2.4
*-Dhadoop.version=2.6.0* -Phive -Phive-thriftserver -DskipTests
In Hadoop 1.x TaskAttemptContext is a class (for example,
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapred/TaskAttemptContext.html)
In Hadoop 2.x TaskAttemptContext is an interface
(https://hadoop.apache.org/docs/r2.4.0/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html)
An RDD is a Resilient *Distributed* Data set. The partitioning and
distribution of the data happens in the background. You'll occasionally
need to concern yourself with it (especially to get good performance), but
from an API perspective it's mostly invisible (some methods do allow you to
specify
Hi Cui,
What version of Spark are you using? There was a bug ticket that may be related
to this, fixed in core/src/main/scala/org/apache/spark/rdd/RDD.scala that is
merged into versions 1.3.0 and 1.2.1 . If you are using 1.1.1 that may be the
reason but it’s a stretch
Hi,
My HDFS and YARN services are started, and my spark-shell can wok in local
mode.
But when I try spark-shell --master yarn-client, a job can be created at
the YARN service, but will fail very soon. The Diagnostics are:
Application application_1425559747310_0002 failed 2 times due to AM
In the console, you'd find this draws a progress bar illustrating the
current stage progress. In logs, it shows up as this sort of 'pyramid'
since CR makes a newline.
You can turn it off with spark.ui.showConsoleProgress = false
On Thu, Mar 5, 2015 at 2:11 AM, cjwang c...@cjwang.us wrote:
When
Hi,
I'm having a problem involving file permissions on the local filesystem.
On a first machine, I have two different users :
- launcher, which launches my job from an uber jar file
- spark, which runs the master
On a second machine, I have a user spark (same uid/gid as the other) which
runs the
Hello Julaiti,
Maybe I am just asking the obvious :-) but did you check disk IO? Depending
on what you are doing that could be the bottleneck.
In my case none of the HW resources was a bottleneck, but using some
distributed features that were blocking execution (e.g. Hazelcast). Could
that be
Hi All
I have a simple spark application, where I am trying to broadcast a String
type variable on YARN Cluster.
But every time I am trying to access the broadcast-ed variable value , I am
getting null within the Task. It will be really helpful, if you guys can
suggest, what I am doing wrong
Cui:
You can check messages.partitions.size to determine whether messages is an
empty RDD.
Cheers
On Thu, Mar 5, 2015 at 12:52 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
When you use KafkaUtils.createStream with StringDecoders, it will return
String objects inside your messages stream.
I am trying to use Apache spark to load up a file, and distribute the file to
several nodes in my cluster and then aggregate the results and obtain them.
I don't quite understand how to do this.
From my understanding the reduce action enables Spark to combine the results
from different nodes and
There is a map function in clojure so you could map one collection to
other.
The most resembling operation is *each*, however when f applied to input
tuple we get tuple with two fields* f([field-a]) =
[ field-a field-b]*.
How could I realize the same operation on trident stream?
If you do not want those progress indication to appear, just set
spark.ui.showConsoleProgress to false, e.g:
System.setProperty(spark.ui.showConsoleProgress, false);
Regards
--
View this message in context:
Can you query upon Hive? Let's confirm if it's a bug of SparkSQL in your PHP
code first.
-Original Message-
From: fanooos [mailto:dev.fano...@gmail.com]
Sent: Thursday, March 5, 2015 4:57 PM
To: user@spark.apache.org
Subject: Connection PHP application to Spark Sql thrift server
We
Great point :) Cui, Here’s a cleaner way than I had before, w/out the use of
spark sql for the mapping:
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, kafka.kafkaParams, Map(github - 5), StorageLevel.MEMORY_ONLY)
.map{ case (k,v) =
Hi Wush,
I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing
u...@spark.incubator.apache.org.
In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see:
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
)
As for a
Hi,
I am new to Spark. I see every spark program has a main() function. I
wonder if I can run the spark program directly, without using spark-submit.
I think it will be easier for early development and debug.
Thanks,
David
Hi Xi,
Yes,
You can do the following:
val sc = new SparkContext(local[2], mptest)
// or .. val sc = new SparkContext(spark://master:7070, mptest)
val fileDataRdd = sc.textFile(/path/to/dir)
val fileLines = fileDataRdd.take(100)
The key here - i.e. the answer to your specific question -
I found my task takes so long time for YoungGen GC, I set the young gen size
to about 1.5G, I wonder why it takes so long time?
not all the tasks take such long time, only about 1% tasks so long...
180.426: [GC [PSYoungGen: 9916105K-1676785K(14256640K)]
26201020K-18690057K(53403648K), 17.3581500
Ok, I solved this problem by :
- changing the primary group of launcher to spark
- adding umask 002 in launcher's .bashrc and spark's init.d script
--
View this message in context:
Hi all,
I'm try to write a spark stream programme so i read the
Hey,
Trying to build latest spark 1.3 with Maven using
-DskipTests clean install package
But I'm getting errors with zinc, in the logs I see;
[INFO]
*--- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
spark-network-common_2.11 --- *
...
[error] Required file not found:
try it with mvn -DskipTests -Pscala-2.11 clean install package
We don't support warm starts or online updates for decision trees. So
if you call train twice, only the second dataset is used for training.
-Xiangrui
On Thu, Mar 5, 2015 at 12:31 PM, drarse drarse.a...@gmail.com wrote:
I am testing the Random Forest in Spark, but I have a question... If I train
Dear all,
I am a new spark user from R.
After exploring the schemaRDD, I notice that it is similar to data.frame.
Is there a feature like `model.matrix` in R to convert schemaRDD to model
matrix automatically according to the type without explicitly converting
them one by one?
Thanks,
Wush
71 matches
Mail list logo