Is there any inbuilt function for calculating percentile over a dataset ?
I want to calculate the percentiles for each column in my data.
Regards,
Kundan
When I looked at this last fall, the only way that seemed to be available was
to transform my data into SchemaRDDs, register them as tables and then use the
Hive processor to calculate them with its built in percentile UDFs that were
added in 1.2.
Curt
From:
You shouldn’t have any issues with differing nodes on the latest Ambari and
Hortonworks. It works fine for mixed hardware and spark on yarn.
Simon
On Jan 26, 2015, at 4:34 PM, Michael Segel msegel_had...@hotmail.com wrote:
If you’re running YARN, then you should be able to mix and max
This is what I get:
./bigcontent-1.0-SNAPSHOT.jar:org/apache/http/impl/conn/SchemeRegistryFactory.class
(probably because I'm using a self-contained JAR).
In other words, I'm still stuck.
--
Emre
On Wed, Jan 28, 2015 at 2:47 PM, Charles Feduke charles.fed...@gmail.com
wrote:
I deal with
Processing one object isn't a distributed operation, and doesn't
really involve Spark. Just invoke your function on your object in the
driver; there's no magic at all to that.
You can make an RDD of one object and invoke a distributed Spark
operation on it, but assuming you mean you have it on
I deal with problems like this so often across Java applications with large
dependency trees. Add the shell function at the following link to your
shell on the machine where your Spark Streaming is installed:
https://gist.github.com/cfeduke/fe63b12ab07f87e76b38
Then run in the directory where
Thanks!
So I assume I can safely run a function *F* of mine within the spark driver
program, without dispatching it to the cluster (?), thereby sticking to one
piece of code for *both* a real cluster run over big data, and for small
on-demand runs for a single input (now and then), both scenarios
I'm looking @ the ShuffledRDD code and it looks like there is a method
setKeyOrdering()- is this guaranteed to order everything in the partition?
I'm on Spark 1.2.0
On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet cjno...@gmail.com wrote:
In all of the soutions I've found thus far, sorting has been
We found the problem and already fixed it. Basically, spark-ec2 requires ec2
instances to have external ip addresses. You need to specify this in the ASW
console.
From: nicholas.cham...@gmail.com
Date: Tue, 27 Jan 2015 17:19:21 +
Subject: Re: spark 1.2 ec2 launch script hang
To:
Hi,
I just wonder if there is any way to unregister/re-register a TempTable in
Spark?
best,
/Shahab
Hi,
My apologies for what has ended up as quite a long email with a lot of
open-ended questions, but, as you can see, I'm really struggling to get
started and would appreciate some guidance from people with more
experience. I'm new to Spark and big data in general, and I'm struggling
with what I
Normally, if this were all in one app, Maven would have solved the
problem for you by choosing 1.8 over 1.6. You do not need to exclude
anything; Maven does it for you.
Here the problem is that 1.8 is in the app but the server (Spark) uses
1.6. This is what the userClassPathFirst setting is for,
Hi
We have a maven project which supports running of spark jobs and pig jobs.
But I could use only either one of elasticsearch-hadoop or
elasticsearch-spark jars at a time.If I use both jars together, I get
conflict in org.elasticsearch.hadoop.cfg.SettingsManager which is presnt as
class in
Upon completion of the 2 hour part of the run, the files did not exist in
the output directory? One thing that is done serially is deleting any
remaining files from _temporary, so perhaps there was a lot of data
remaining in _temporary but the committed data had already been moved.
I am,
I've created a spark app, which runs fine if I copy the corresponding
jar to the hadoop-server (where yarn is running) and submit it there.
If it try it to submit it from my local machine, I get the error which
I've attached below.
Submit cmd: spark-submit.cmd --class
TLDR Extend FileOutPutCommitter to eliminate the temporary_storage. There
are some implementations to be found online, typically called
DirectOutputCommitter, f.i. this spark pull request
https://github.com/themodernlife/spark/commit/4359664b1d557d55b0579023df809542386d5b8c.
Tell Spark to use your
hi, thanks for the quick answer -- I suppose this is possible, though I
don't understand how it could come about. The largest individual RDD
elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of
800k of them. The file is saved in 134 parts, but is being read in using
some 1916+
A search shows several historical threads for similar Kryo issues, but none
seem to have a definitive solution. Currently using Spark 1.2.0.
While collecting/broadcasting/grouping moderately sized data sets (~500MB -
1GB), I regularly see exceptions such as the one below.
I’ve tried increasing
Hi,
How would I run a given function in Spark, over a single input object?
Would I first add the input to the file system, then somehow invoke the
Spark function on just that input? or should I rather twist the Spark
streaming api for it?
Assume I'd like to run a piece of computation that
That indicates that you are using two different versions of es-hadoop (2.0.x)
and es-spark (2.1.x)
Have you considered aligning the two versions?
On 1/28/15 11:08 AM, aarthi wrote:
Hi
We have a maven project which supports running of spark jobs and pig jobs.
But I could use only either one of
By typo i meant that the column name had a spelling error:
conversion_aciton_id.
It should have been conversion_action_id.
No, we tried it a few times, and we didn't have + signs or anything like
that - we tried it with columns of different types too - string, double etc
and saw the same error.
I've switched to maven and all issues are gone, now.
2015-01-23 12:07 GMT+01:00 Sean Owen so...@cloudera.com:
Use mvn dependency:tree or sbt dependency-tree to print all of the
dependencies. You are probably bringing in more servlet API libs from
some other source?
On Fri, Jan 23, 2015 at
Hi Danny,
What you describe sounds like you may also consider to use Spring XD instead,
at least for the file-centric stuff.
Regards
Ben
Von meinem iPad gesendet
Am 28.01.2015 um 10:42 schrieb Danny Yates da...@codeaholics.org:
Hi,
My apologies for what has ended up as quite a long
It's because you committed the job in Windows to a Hadoop cluster running
in Linux. Spark has not yet supported it. See
https://issues.apache.org/jira/browse/SPARK-1825
Best Regards,
Shixiong Zhu
2015-01-28 17:35 GMT+08:00 Marco marco@gmail.com:
I've created a spark app, which runs fine if
Then your spark is not built for yarn. Try to build with
sbt/sbt -Dhadoop.version=2.3.0 -Pyarn assembly
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382p21404.html
Sent from the Apache
Hello,
I'm using *Spark 1.1.0* and *Solr 4.10.3*. I'm getting an exception when
using *HttpSolrServer* from within Spark Streaming:
15/01/28 13:42:52 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError:
Hello Spark fellows :),
I think I need some help to understand how .cache and task input works within a
job.
I have an 7 GB input matrix in HDFS that I load using .textFile(). I also have
a config file which contains an array of 12 Logistic Regression Model
parameters, loaded as an
So, following up on your suggestion, I'm still having some problems getting the
configuration changes recognized when my job run.
I've added jets3t.properties to the root of my application jar file that I
submit to Spark (via spark-submit).
I've verified that my jets3t.properties is at the
On Wed, Jan 28, 2015 at 1:44 PM, Matan Safriel dev.ma...@gmail.com wrote:
So I assume I can safely run a function F of mine within the spark driver
program, without dispatching it to the cluster (?), thereby sticking to one
piece of code for both a real cluster run over big data, and for small
I'm running into a new issue with Snappy causing a crash (using Spark
1.2.0). Did anyone see this before?
-Sven
2015-01-28 16:09:35,448 WARN [shuffle-server-1] storage.MemoryStore
(Logging.scala:logWarning(71)) - Failed to reserve initial memory
threshold of 1024.0 KB for computing block
Stephen,
Scala 2.11 worked fine for me. Did the dev change and then compile. Not
using in production, but I go back and forth between 2.10 2.11.
Cheers
k/
On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman
stephen.haber...@gmail.com wrote:
Hey,
I recently compiled Spark master against
Hmm, I can’t see why using ~ would be problematic, especially if you
confirm that echo ~/path/to/pem expands to the correct path to your
identity file.
If you have a simple reproduction of the problem, please send it over. I’d
love to look into this. When I pass paths with ~ to spark-ec2 on my
Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or
windows 7? How do you get that works?
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent:
Yeah, I agree ~ should work. And it could have been [read: probably was]
the fact that one of the EC2 hosts was in my known_hosts (don't know, never
saw an error message, but the behavior is no error message for that state),
which I had fixed later with Pete's patch. But the second execution when
Hey,
I recently compiled Spark master against scala-2.11 (by running
the dev/change-versions script), but when I run spark-shell,
it looks like the sc variable is missing.
Is this a known/unknown issue? Are others successfully using
Spark with scala-2.11, and specifically spark-shell?
It is
Hello all,
I've been hitting a divide by zero error in Parquet though Spark detailed
(and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102
Is anyone else hitting this error? I hit it frequently.
It looks like the Parquet team is preparing to release 1.6.0 and, since they
It looks like it's just a problem with the log message? is it actually
causing a problem in Parquet / Spark? but yeah seems like an easy fix.
On Wed, Jan 28, 2015 at 9:28 PM, Jim Carroll jimfcarr...@gmail.com wrote:
Hello all,
I've been hitting a divide by zero error in Parquet though Spark
Answered my own questions seconds later: these aren't doubles, so you
don't get NaN, you get an Exception. Right.
On Wed, Jan 28, 2015 at 9:35 PM, Sean Owen so...@cloudera.com wrote:
It looks like it's just a problem with the log message? is it actually
causing a problem in Parquet / Spark? but
I have wrote a custom input split and I want to set to the specific node,
where my data is stored. but currently split can start at any node and pick
data from different node in the cluster. any suggestion, how to set host in
spark
--
View this message in context:
Hi,
On Thu, Jan 29, 2015 at 1:54 AM, YaoPau jonrgr...@gmail.com wrote:
My thinking is to maintain state in an RDD and update it an persist it with
each 2-second pass, but this also seems like it could get messy. Any
thoughts or examples that might help me?
I have just implemented some
Hi,
Probably a naive question.. But I am creating a spark cluster on ec2
using the ec2 scripts in there..
But is there a master param I need to set..
./bin/pyspark --master [ ] ??
I don't yet fully understand the ec2 concepts so just wanted to confirm
this??
Thanks
--
Mohit
When you want
It was only hanging when I specified the path with ~ I never tried relative.
Hanging on the waiting for ssh to be ready on all hosts. I let it sit for
about 10 minutes then I found the StackOverflow answer that suggested
specifying an absolute path, cancelled, and re-run with --resume and the
I'm not an expert on streaming, but I think you can't do anything like this
right now. It seems like a very sensible use case, though, so I've created
a jira for it:
https://issues.apache.org/jira/browse/SPARK-5467
On Wed, Jan 28, 2015 at 8:54 AM, YaoPau jonrgr...@gmail.com wrote:
The
Hello,
probably this question was already asked but still I'd like to confirm from
Spark users.
This following blog shows 'hive on spark' :
http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/;.
How is it different from using hive as data storage of SparkSQL
Hey Yana,
An update about this Parquet filter push-down issue. It turned out to be
a bit complicated, but (hopefully) all clear now.
1.
Yesterday I found a bug in Parquet, which essentially disables row
group filtering for almost all |AND| predicates.
* JIRA ticket: PARQUET-173
Hello all,
I'm trying to build the sample application on the spark 1.2.0 quickstart
page (https://spark.apache.org/docs/latest/quick-start.html) using the
following build.sbt file:
name := Simple Project
version := 1.0
scalaVersion := 2.10.4
libraryDependencies += org.apache.spark %%
Hello,
We are trying to insert a case class in Parquet using SparkSql. When i'm
creating the SchemaRDD, that include a Set, i have the following exception:
sqc.createSchemaRDD(r)
scala.MatchError: Set[(scala.Int, scala.Int)] (of class
scala.reflect.internal.Types$TypeRef$$anon$1)
at
It looks like you're shading in the Apache HTTP commons library and its a
different version than what is expected. (Maybe 4.6.x based on the Javadoc.)
I see you are attempting to exclude commons-httpclient by using:
exclusion
groupIdcommons-httpclient/groupId
--
Abhi Basu
Hi Guys,
I have the similar question and doubt. How spark create an executor on the
same node where is data block stored? Does it first take information from
HDFS name mode, get the block information and then place executor on the
same node is spark-worker demon is installed?
-
Hi,
Is it possible to append to an existing (hdfs) file, through some Spark
action?
Should there be any reason not to use a hadoop append api within a Spark
job?
Thanks,
Matan
You can call any API you like in a Spark job, as long as the libraries
are available, and Hadoop HDFS APIs will be available from the
cluster. You could write a foreachPartition() that appends partitions
of data to files, yes.
Spark itself does not use appending. I think the biggest reason is
send an email to user-unsubscr...@spark.apache.org
Cheers
On Wed, Jan 28, 2015 at 2:16 PM, Abhi Basu 9000r...@gmail.com wrote:
--
Abhi Basu
Hello all,
I'm trying to build the sample application on the spark 1.2.0 quickstart
page (https://spark.apache.org/docs/latest/quick-start.html) using the
following build.sbt file:
name := Simple Project
version := 1.0
scalaVersion := 2.10.4
libraryDependencies += org.apache.spark %%
https://issues.apache.org/jira/browse/SPARK-2356
Take a look through the comments, there are some workarounds listed there.
On Wed, Jan 28, 2015 at 1:40 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or
Hi Jim,
I am sorry, I know about your patch and I will commit it ASAP.
Lukas Nalezenec
On 28.1.2015 22:28, Jim Carroll wrote:
Hello all,
I've been hitting a divide by zero error in Parquet though Spark detailed
(and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102
Is
I think this repartitionAndSortWithinPartitions() method may be what I'm
looking for in [1]. At least it sounds like it is. Will this method allow
me to deal with sorted partitions even when the partition doesn't fit into
memory?
[1]
For example, We consider the word count of the long text data (100GB order).
There is clearly a bias for the word , has been expected to be a long tail data
do word count. Probably word number 1 occupies about over 1 / 10.
word count code
```
val allWordLineSplited: RDD[String] = // create
Hi,
I am getting a stack overflow error when querying a schemardd comprised of
parquet files. This is (part of) the stack trace:
Caused by: java.lang.StackOverflowError
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
Ohhh nice! Would be great if you can share us some code soon. It is
indeed a very complicated problem and there is probably no single
solution that fits all usecases. So having one way of doing things
would be a great reference. Looking forward to that!
On Wed, Jan 28, 2015 at 4:52 PM, Tobias
Thanks for sending this over, Peter.
What if you try this? (i.e. Remove the = after --identity-file.)
ec2/spark-ec2 --key-pair=spark-streaming-kp --identity-file
~/.pzkeys/spark-streaming-kp.pem --region=us-east-1 login
pz-spark-cluster
If that works, then I think the problem in this case is
If that was indeed the problem, I suggest updating your answer on SO
http://stackoverflow.com/a/28005151/877069 to help others who may run
into this same problem.
On Wed Jan 28 2015 at 9:40:39 PM Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Thanks for sending this over, Peter.
What
Hi
How to set a preferred location for an InputSplit in spark standalone?
I have data in specific machine and I want to read them using Splits which
is created for that node only, by assigning some property which help Spark
to create a split in that node only.
--
View this message in
You could use foreachRDD to do the operations and then inside the
foreach create an accumulator to gather all the errors together
dstream.foreachRDD { rdd =
val accumulator = new Accumulator[]
rdd.map { . }.count // whatever operation that is error prone
// gather all errors
Thanks very much. It seems that I have to use HiveContext at present.
在 2015年1月28日,上午11:34,Kuldeep Bora kuldeep.b...@gmail.com 写道:
UDAF is a WIP, at least from API user's perspective as there is no public API
to my knowledge.
https://issues.apache.org/jira/browse/SPARK-3947
Thanks
On
Cheers
Date: Wed, 28 Jan 2015 14:18:49 -0800
Subject: Re: unsubscribe
From: yuzhih...@gmail.com
To: 9000r...@gmail.com
CC: user@spark.apache.org
send an email to user-unsubscr...@spark.apache.org
Cheers
On Wed, Jan 28, 2015 at 2:16 PM, Abhi Basu 9000r...@gmail.com wrote:
--
Abhi Basu
Below is trace from trying to access with ~/path. I also did the echo as
per Nick (see the last line), looks ok to me. This is my development box
with Spark 1.2.0 running CentOS 6.5, Python 2.6.6
[pete.zybrick@pz-lt2-ipc spark-1.2.0]$ ec2/spark-ec2
--key-pair=spark-streaming-kp
That's definitely a good supplement to the current Spark Streaming, I've heard
many guys want to process the data using log time. Looking forward to the code.
Thanks
Jerry
-Original Message-
From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Thursday, January 29, 2015 10:33
Spark SQL on Hive
1. The purpose of Spark SQL is to allow Spark users to selectively use SQL
expressions (with not a huge number of functions currently supported) when
writing Spark jobs
2. Already Available
Hive on Spark
1.Spark users will automatically get the whole set of Hive’s rich
Hi Fanilo,
How many cores are you using per executor? Are you aware that you can
combat the container is running beyond physical memory limits error by
bumping the spark.yarn.executor.memoryOverhead property?
Also, are you caching the parsed version or the text?
-Sandy
On Wed, Jan 28, 2015 at
When I examine the dependencies again, I see that SolrJ library is using v.
4.3.1 of org.apache.httpcomponents:httpclient
[INFO] +- org.apache.solr:solr-solrj:jar:4.10.3:compile
[INFO] | +- org.apache.httpcomponents:httpclient:jar:4.3.1:compile
==
[INFO] | +-
Each machine has 24 cores, but I assume each executor on a machine is
attributed one core max because I set the –executor-cores property to 1.
I’m going to try a higher memoryOverhead later, I’ll post the results.
I’m caching the parsed version, something like
val matrix =
I have been trying to work around a similar problem with my Typesafe config
*.conf files seemingly not appearing on the executors. (Though now that I
think about it its not because the files are absent in the JAR, but because
the -Dconf.resource environment variable I pass to the master obviously
Yeah it sounds like your original exclusion of commons-httpclient from
hadoop-* was correct, but its still coming in from somewhere.
Can you try something like this?:
dependency
artifactIdcommons-http/artifactId
groupIdhttpclient/groupId
scopeprovided/scope
/dependency
ref:
Hi Antony, Did you get pass this error by repartitioning your job with
smaller tasks as Sven Krasser pointed out?
From: Antony Mayi antonym...@yahoo.com
Reply-To: Antony Mayi antonym...@yahoo.com
Date: Tuesday, January 27, 2015 at 5:24 PM
To: Guru Medasani gdm...@outlook.com, Sven Krasser
I'm not quiet sure if i understood it correctly, but can you not create a
key from the timestamps and do the reduceByKeyAndWindow over it?
Thanks
Best Regards
On Wed, Jan 28, 2015 at 10:24 PM, YaoPau jonrgr...@gmail.com wrote:
The TwitterPopularTags example works great: the Twitter firehose
HadoopRDD will try to split the file as 64M partitions in size, so you
got 1916+ partitions.
(assume 100k per row, they are 80G in size).
I think it has very small chance that one object or one batch will be
bigger than 2G.
Maybe there are a bug when it split the pickled file, could you create
a
Hey Jorge,
This is expected. Because there isn’t an obvious mapping from |Set[T]|
to any SQL types. Currently we have complex types like array, map, and
struct, which are inherited from Hive. In your case, I’d transform the
|Set[T]| into a |Seq[T]| first, then Spark SQL can map it to an
The TwitterPopularTags example works great: the Twitter firehose keeps
messages pretty well in order by timestamp, and so to get the most popular
hashtags over the last 60 seconds, reduceByKeyAndWindow works well.
My stream pulls Apache weblogs from Kafka, and so it's not as simple:
messages can
Thanks Sean. that works and I started the join of this mappedRDD to another one
I have.I have to internalize the use of Map versus FlatMap. Thinking Map Reduce
Java Hadoop code often blinds me :-)
From: Sean Owen so...@cloudera.com
To: Sanjay Subramanian sanjaysubraman...@yahoo.com
Cc:
80 matches
Mail list logo