I tried
val pairVarOriRDD = sc.newAPIHadoopFile(path,
classOf[NetCDFFileInputFormat].asSubclass(
classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[WRFIndex,WRFVariable]]),
classOf[WRFIndex],
classOf[WRFVariable],
jobConf)
The compiler does not
I am not sure if your issue is setting the Fair mode correctly or something
else so let's start with the FAIR mode.
Do you see scheduler mode actually being set to FAIR:
I have this line in spark-defaults.conf
spark.scheduler.allocation.file=/spark/conf/fairscheduler.xml
Then, when I start my
Hi all,
I am trying to run a Spark Java application on EMR, but I keep getting
NullPointerException from the Application master (spark version on
EMR: 1.2). The stacktrace is below. I also tried to run the
application on Hortonworks Sandbox (2.2) with spark 1.2, following the
blogpost
What I think is happening that the map operations are executed concurrently
and the map operation in rdd2 has the initial copy of myObjectBroadcated.
Is there a way to apply the transformations sequentially? First materialize
rdd1 and then rdd2.
Thanks a lot!
On 24 February 2015 at 18:49,
My application runs fine for ~3/4 hours and then hits this issue.
On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha me.mukesh@gmail.com
wrote:
Hi Experts,
My Spark Job is failing with below error.
From the logs I can see that input-3-1424842351600 was added at 5:32:32
and was never purged
Hello!
I've read the documentation about the spark architecture, I have the
following questions:
1: how many executors can be on a single worker process (JMV)?
2:Should I think executor like a Java Thread Executor where the pool size
is equal with the number of the given cores (set up by the
Let me ask like this, what would be the easiest way to display the
throughput in the web console? Would I need to create a new tab and add the
metrics? Any good or simple examples showing how this can be done?
On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Did you
For SparkStreaming applications, there is already a tab called Streaming
which displays the basic statistics.
Thanks
Best Regards
On Wed, Feb 25, 2015 at 8:55 PM, Josh J joshjd...@gmail.com wrote:
Let me ask like this, what would be the easiest way to display the
throughput in the web
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
For SparkStreaming applications, there is already a tab called Streaming
which displays the basic statistics.
Would I just need to extend this tab to add the throughput?
I have Spark running in standalone mode with 4 executors, and each executor
with 5 cores each (spark.executor.cores=5). However, when I'm processing
an RDD with ~90,000 partitions, I only get 4 parallel tasks. Shouldn't I
be getting 4x5=20 parallel task executions?
Hi Yiannis,
Broadcast variables are meant for *immutable* data. They are not meant for
data structures that you intend to update. (It might *happen* to work when
running local mode, though I doubt it, and it would probably be a bug if it
did. It will certainly not work when running on a
Hello,
I am preparing some tests to execute in Spark in order to manipulate
properties and check the variations in results.
For this, I need to use a Standard Application in my environment like the
well-known apps to Hadoop: Terasort
By throughput you mean Number of events processed etc?
[image: Inline image 1]
Streaming tab already have these statistics.
Thanks
Best Regards
On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote:
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I have been running into NegativeArraySizeException's when doing joins on
data with very skewed key distributions in Spark 1.2.0. I found a previous
post that mentioned that this exception arises when the size of the blocks
spilled during the shuffle exceeds 2GB. The post recommended increasing
Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m
and I made sparsity pattern 100:1.5M..we would like to increase the
sparsity pattern to 1000:1.5M
I am running 1.1 stable and I get random shuffle failures...may be 1.2 sort
shuffle will help..
I read in Reza paper that
Hi all,
I am trying to run a Spark Java application on EMR, but I keep getting
NullPointerException from the Application master (spark version on
EMR: 1.2). The stacktrace is below. I also tried to run the
application on Hortonworks Sandbox (2.2) with spark 1.2, following the
blogpost
Hi all
Here is another Spark talk (a vendor-independent one!) that you might have
missed:
'The Future of Apache Hadoop' track: How Spark and Flink are shaping the future
of Hadoop?
https://hadoopsummit.uservoice.com/forums/283266-the-future-of-apache-hadoop/suggestions/7074410
Regards,
Slim
Cheng, We tried this setting and it still did not help. This was on Spark
1.2.0.
--
Kannan
On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com wrote:
(Move to user list.)
Hi Kannan,
You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this
line of code
Made 3 votes to each of the talks. Looking forward to see them in
Hadoop Summit:) -Xiangrui
On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin r...@databricks.com wrote:
Hi all,
The Hadoop Summit uses community choice voting to decide which talks to
feature. It would be great if the community could
Please add the Zagreb Meetup group, too.
http://www.meetup.com/Apache-Spark-Zagreb-Meetup/
Thanks!
On 18.2.2015. 19:46, Johan Beisser wrote:
If you could also add the Hamburg Apache Spark Meetup, I'd appreciate it.
http://www.meetup.com/Hamburg-Apache-Spark-Meetup/
On Tue, Feb 17, 2015 at
It looks like that is getting interpreted as a local path. Are you missing
a core-site.xml file to configure hdfs?
On Tue, Feb 24, 2015 at 10:40 PM, kundan kumar iitr.kun...@gmail.com
wrote:
Hi Denny,
yes the user has all the rights to HDFS. I am running all the spark
operations with this
Hi Josh,
SPM will show you this info. I see you use Kafka, too, whose numerous metrics
you can also see in SPM side by side with your Spark metrics. Sounds like
trends is what you are after, so I hope this helps. See http://sematext.com/spm
Otis
On Feb 24, 2015, at 11:59, Josh J
Interesting. Looking at SparkConf.scala :
val configs = Seq(
DeprecatedConfig(spark.files.userClassPathFirst,
spark.executor.userClassPathFirst,
1.3),
DeprecatedConfig(spark.yarn.user.classpath.first, null, 1.3,
Use spark.{driver,executor}.userClassPathFirst
Hello Imran,
Thanks for your response. I noticed the intersection and subtract
methods for a RDD, does they work based on hash off all the fields in a RDD
record ?
- Himanish
On Thu, Feb 19, 2015 at 6:11 PM, Imran Rashid iras...@cloudera.com wrote:
the more scalable alternative is to do a
Look at the trace again. It is a very weird error. The SparkSubmit is running
on client side, but YarnClusterSchedulerBackend is supposed in running in YARN
AM.
I suspect you are running the cluster with yarn-client mode, but in
JavaSparkContext you set yarn-cluster”. As a result, spark
Hello
i am trying to pass as a parameter a org.apache.spark.rdd.RDD table to a
recursive function. This table should be changed in any step of the
recursion and could not be just a global var
need help :)
Thank you
--
View this message in context:
I have this function in the driver program which collects the result from
rdds (in a stream) into an array and return. However, even though the RDDs
(in the dstream) have data, the function is returning an empty array...What
am I doing wrong?
I can print the RDD values inside the foreachRDD call
Using Hivecontext solved it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p21807.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Not sure if this is the place to ask, but i am using the terasort branche of
Spark for benchmarking, as found on
https://github.com/ehiggs/spark/tree/terasort, and I get the error below
when running on two machines (one machine works just fine). When looking at
the code, listed below the error
Hello,
Does spark standalone support running multiple executors in one worker node?
It seems yarn has the parameter --num-executors to set number of executors to
deploy, but I do not find the equivalent parameter in spark standalone.
Thanks,
Judy
Hi,
On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore
thanigai.vell...@gmail.com wrote:
It appears that the function immediately returns even before the
foreachrdd stage is executed. Is that possible?
Sure, that's exactly what happens. foreachRDD() schedules a computation, it
does not
I didn't include the complete driver code but I do run the streaming
context from the main program which calls this function. Again, I can print
the red elements within the foreachrdd block but the array that is returned
is always empty. It appears that the function immediately returns even
before
Hi Reza,
With 40 nodes and shuffle space managed by YARN over HDFS usercache we
could run the similarity job without doing any thresholding...We used hash
based shuffle and sort hopefully will further improve it...Note that this
job was almost 6M x 1.5M
We will go towards 50 M x ~ 3M columns and
Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal
information of spark. From the web and this list, I keep seeing people talking
about increase the parallelism if you get the OOM error. I tried to read
document as much as possible to understand the RDD
You can easily add a function (say setup_pig) inside the function
setup_cluster in this script
https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L649
Thanks
Best Regards
On Thu, Feb 26, 2015 at 7:08 AM, Sameer Tilak ssti...@live.com wrote:
Hi,
I was looking at the documentation
Spark and Hadoop should be listed as 'provided' dependency in your
Maven or SBT build. But that should make it available at compile time.
On Wed, Feb 25, 2015 at 10:42 PM, boci boci.b...@gmail.com wrote:
Hi,
I have a little question. I want to develop a spark based application, but
spark
Getting an error that confuses me. Running a largish app on a standalone
cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With
Spark 1.1.0 I simply registered the class and its serializer with kryo like
this:
Thanks dude... I think I will pull up a docker container for integration
test
--
Skype: boci13, Hangout: boci.b...@gmail.com
On Thu, Feb 26, 2015 at 12:22 AM, Sean Owen
Could this be caused by Spark using shaded Guava jar ?
Cheers
On Wed, Feb 25, 2015 at 3:26 PM, Pat Ferrel p...@occamsmachete.com wrote:
Getting an error that confuses me. Running a largish app on a standalone
cluster on my laptop. The app uses a guava HashBiMap as a broadcast value.
With
How many reducers you set for Hive? With small data set, Hive will run in local
mode, which will set the reducer count always as 1.
From: Kannan Rajah [mailto:kra...@maprtech.com]
Sent: Thursday, February 26, 2015 3:02 AM
To: Cheng Lian
Cc: user@spark.apache.org
Subject: Re: Spark-SQL 1.2.0 sort
Yes, been on the books for a while ...
https://issues.apache.org/jira/browse/SPARK-2356
That one just may always be a known 'gotcha' in Windows; it's kind of
a Hadoop gotcha. I don't know that Spark 100% works on Windows and it
isn't tested on Windows.
On Wed, Feb 25, 2015 at 11:05 PM, boci
I have an application which might benefit from Sparks
distribution/analysis, but I'm worried about the size and structure of
my data set. I need to perform several thousand simulation on a rather
large data set and I need access to all the generated simulations. The
data element is largely
I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8
r3.8xlarge machines but limit the job to only 128 cores. I have also tried
other things such as setting 4 workers per r3.8xlarge and 67gb each but this
made no difference.
The job frequently fails at the end in this step
Hi,
I was looking at the documentation for deploying Spark cluster on EC2.
http://spark.apache.org/docs/latest/ec2-scripts.html
We are using Pig to build the data pipeline and then use MLLib for analytics. I
was wondering if someone has any experience to include additional
tools/services
Yes. # tuples processed in a batch = sum of all the tuples received by all
the receivers.
In screen shot, there was a batch with 69.9K records, and there was a batch
which took 1 s 473 ms. These two batches can be the same, can be different
batches.
TD
On Wed, Feb 25, 2015 at 10:11 AM, Josh J
I'm getting this really reliably on Spark 1.2.1. Basically I'm in local
mode with parallelism at 8. I have 222 tasks and I never seem to get far
past 40. Usually in the 20s to 30s it will just hang. The last logging is
below, and a screenshot of the UI.
2015-02-25 20:39:55.779 GMT-0800 INFO
Forwarding conversation below that didn't make it to the list.
-- Forwarded message --
From: Jim Kleckner j...@cloudphysics.com
Date: Wed, Feb 25, 2015 at 8:42 PM
Subject: Re: Spark excludes fastutil dependencies we need
To: Ted Yu yuzhih...@gmail.com
Cc: Sean Owen
I am now getting the following error. I cross-checked my types and
corrected three of them i.e. r26--String, r27--Timestamp,
r28--Timestamp. This error still persists.
scala sc.textFile(/home/cdhuser/Desktop/Sdp_d.csv).map(_.split(,)).map
{ r =
| val upto_time = sdf.parse(r(23).trim);
|
I get the same exception simply by doing a large broadcast of about 6GB.
Note that I’m broadcasting a small number (~3m) of fat objects. There’s
plenty of free RAM. This and related kryo exceptions seem to crop-up
whenever an object graph of more than a couple of GB gets passed around.
at
Which version of spark are you having? It seems there was a similar Jira
https://issues.apache.org/jira/browse/SPARK-2474
Thanks
Best Regards
On Thu, Feb 26, 2015 at 12:03 PM, tridib tridib.sama...@live.com wrote:
Hi,
I need to find top 10 most selling samples. So query looks like:
select
Actually I just realized , I am using 1.2.0.
Thanks
Tridib
Date: Thu, 26 Feb 2015 12:37:06 +0530
Subject: Re: group by order by fails
From: ak...@sigmoidanalytics.com
To: tridib.sama...@live.com
CC: user@spark.apache.org
Which version of spark are you having? It seems there was a similar Jira
I created an issue and pull request.
Discussion can continue there:
https://issues.apache.org/jira/browse/SPARK-6029
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Spark-excludes-fastutil-dependencies-we-need-tp21812p21814.html
Sent from the Apache
Hi,
I need to find top 10 most selling samples. So query looks like:
select s.name, count(s.name) from sample s group by s.name order by
count(s.name)
This query fails with following error:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
Sort [COUNT(name#0) ASC], true
What operation are you trying to do and how big is the data that you are
operating on?
Here's a few things which you can try:
- Repartition the RDD to a higher number than 222
- Specify the master as local[*] or local[10]
- Use Kryo Serializer (.set(spark.serializer,
Inline
On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu yuzhih...@gmail.com wrote:
Interesting. Looking at SparkConf.scala :
val configs = Seq(
DeprecatedConfig(spark.files.userClassPathFirst,
spark.executor.userClassPathFirst,
1.3),
Can you try increasing the ulimit -n on your machine.
On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com wrote:
Hi Sameer,
I’m still using Spark 1.1.1, I think the default is hash shuffle. No
external shuffle service.
We are processing gzipped JSON files, the partitions are
Maybe drop the exclusion for parquet-provided profile ?
Cheers
On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner j...@cloudphysics.com wrote:
Inline
On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu yuzhih...@gmail.com wrote:
Interesting. Looking at SparkConf.scala :
val configs = Seq(
Spark Streaming has a new Kafka direct stream, to be release as
experimental feature with 1.3. That uses a low level consumer. Not sure if
it satisfies your purpose.
If you want more control, its best to create your own Receiver with the low
level Kafka API.
TD
On Tue, Feb 24, 2015 at 12:09 AM,
Did you try setting .set(spark.cores.max, 20)
Thanks
Best Regards
On Wed, Feb 25, 2015 at 10:21 PM, Akshat Aranya aara...@gmail.com wrote:
I have Spark running in standalone mode with 4 executors, and each
executor with 5 cores each (spark.executor.cores=5). However, when I'm
processing an
Hello Spark experts
I have tried reading spark documentation and searched many posts in
this forum but I couldn't find satisfactory answer to my question. I have
recently started using spark, so I may be missing something and that's why I
am looking for your guidance here.
I have a
Hi,
I want to setup a Spark cluster with YARN dependency on Amazon EC2. I was
reading this https://spark.apache.org/docs/1.2.0/running-on-yarn.html
document and I understand that Hadoop has to be setup for running Spark with
YARN. My questions -
1. Do we have to setup Hadoop cluster on EC2
Hi,
just a quick question about calling persist with the _2 option. Is the 2x
replication only useful for fault tolerance, or will it also increase job speed
by avoiding network transfers? Assuming I’m doing joins or other shuffle
operations.
Thanks
If you mean, can both copies of the blocks be used for computations?
yes they can.
On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier mps@gmail.com wrote:
Hi,
just a quick question about calling persist with the _2 option. Is the 2x
replication only useful for fault tolerance, or will it
Yes. Effectively, could it avoid network transfers? Or put differently, would
an option like persist(MEMORY_ALL) improve job speed by caching an RDD on every
worker?
On 25.02.2015, at 11:42, Sean Owen so...@cloudera.com wrote:
If you mean, can both copies of the blocks be used for
We're using the capacity scheduler, to the best of my knowledge. Unsure if
multi resource scheduling is used, but if you know of an easy way to figure
that out, then let me know.
Thanks,
Anders
On Sat, Feb 21, 2015 at 12:05 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:
Are you using the
This is the declaration of my custom inputformat
public class NetCDFFileInputFormat extends ArrayBasedFileInputFormat
public abstract class ArrayBasedFileInputFormat extends
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
Best,
Patcharee
On 25. feb. 2015 10:15, patcharee wrote:
Hi,
OK, from the declaration you sent me separately:
public class NetCDFFileInputFormat extends ArrayBasedFileInputFormat
public abstract class ArrayBasedFileInputFormat extends
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
It looks like you do not declare any generic types that
The link has proved helpful. I have been able to load data, register it as
a table and perform simple queries. Thanks Akhil !!
Though, I still look forward to knowing where I was going wrong with my
previous technique of extending the Product Interface to overcome case
class's limit of 22 fields.
It says sdp_d not found, since it is a class you need to instantiate it
once. like:
sc.textFile(derby.log).map(_.split(,)).map( r = {
val upto_time = sdf.parse(r(23).trim);
calendar.setTime(upto_time);
val r23 = new java.sql.Timestamp(upto_time.getTime);
Hi Yana,
I tried running the program after setting the property
spark.scheduler.mode to FAIR. But the result is same as previous. Are
there any other properties that have to be set?
On Tue, Feb 24, 2015 at 10:26 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
It's hard to tell. I have not
Hi,
I am new to spark and scala. I have a custom inputformat (used before
with mapreduce) and I am trying to use it in spark.
In java api (the syntax is correct):
JavaPairRDDWRFIndex, WRFVariable pairVarOriRDD = sc.newAPIHadoopFile(
path,
NetCDFFileInputFormat.class,
Did you have a look at
https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener
And for Streaming:
https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener
Thanks
Best Regards
On Tue, Feb 24,
Thanks Akhil , it was a simple fix which you told .. I missed it .. ☺
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Wednesday, February 25, 2015 12:48 PM
To: Somnath Pandeya
Cc: user@spark.apache.org
Subject: Re: used cores are less then total no. of core
You can set the following in
These settings don't control what happens to stderr, right? stderr is
up to the process that invoked the driver to control. You may wish to
configure log4j to log to files instead.
On Wed, Nov 12, 2014 at 8:15 PM, Nguyen, Duc duc.ngu...@pearson.com wrote:
I've also tried setting the
No, we should not add fastutil back. It's up to the app to bring
dependencies it needs, and that's how I understand this issue. The
question is really, how to get the classloader visibility right. It
depends on where you need these classes. Have you looked into
spark.files.userClassPathFirst and
I believe your class needs to be defined as a case class (as I answered
on SO)..
On 25.2.2015. 5:15, anamika gupta wrote:
Hi Akhil
I guess it skipped my attention. I would definitely give it a try.
While I would still like to know what is the issue with the way I have
created schema?
Hi,
Is there a good way (recommended way) to control and run multiple Spark jobs
within the same application? My application is like follows;
1) Run one Spark job on a 'ful' dataset, which then creates a few thousands
of RDDs containing sub-datasets from the complete dataset. Each of the
77 matches
Mail list logo