Hi,
We are running a time sensitive application with 70 partition and 800MB each
parition size. The application first load data from database in different
cluster, then apply a filter, cache the filted data, then apply a map and a
reduce, finally collect results.
The application will be
LICENSE and NOTICE are fine. Signature and checksum is fine. I
unzipped and built the plain source distribution, which built.
However I am seeing a consistent test failure with mvn -DskipTests
clean package; mvn test. In the Hive module:
- SET commands semantics for a HiveContext *** FAILED ***
Hi all
I have noticed that “Join” operator has been transferred to union and
groupByKey operator instead of cogroup operator in PySpark, this change
will probably generate more shuffle stage, for example
rdd1 = sc.makeRDD(...).partitionBy(2)
rdd2 = sc.makeRDD(...).partitionBy(2)
Thanks Chneg, Just one more question - does that mean that we still need
enough memory in the cluster to uncompress the data before it can be
compressed again or does that just read the raw data as is?
On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com wrote:
Currently there’s
Hello there,
So I just took a quick look at the pom and I see two problems with it.
- activatedByDefault does not work like you think it does. It only
activates by default if you do not explicitly activate other
profiles. So if you do mvn package, scala-2.10 will be activated;
but if you do mvn
Hi Anant,
Please see the changes.
https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
I have changed the input format to Vector of String. I think we can also make
it generic.
Line 59 72 : that counter will not affect in parallelism, Since it
Hey Patrick,
On Thu, Nov 13, 2014 at 10:49 AM, Patrick Wendell pwend...@gmail.com wrote:
I'm not sure chaining activation works like that. At least in my
experience activation based on properties only works for properties
explicitly specified at the command line rather than declared
elsewhere
On Thu, Nov 13, 2014 at 10:58 AM, Patrick Wendell pwend...@gmail.com wrote:
That's true, but note the code I posted activates a profile based on
the lack of a property being set, which is why it works. Granted, I
did not test that if you activate the other profile, the one with the
property
Hi,
Shivaram and I stumbled across this problem a few weeks ago, and AFAIK
there is no nice solution. We worked around it by avoiding jobs with tasks
that have tasks with two locality levels.
To fix this problem, we really need to fix the underlying problem in the
scheduling code, which
Hey Sean,
Thanks for pointing this out. Looks like a bad test where we should be
doing Set comparison instead of Array.
Michael
On Thu, Nov 13, 2014 at 2:05 AM, Sean Owen so...@cloudera.com wrote:
LICENSE and NOTICE are fine. Signature and checksum is fine. I
unzipped and built the plain
Do people usually important o.a.spark.rdd._ ?
Also in order to maintain source and binary compatibility, we would need to
keep both right?
On Thu, Nov 6, 2014 at 3:12 AM, Shixiong Zhu zsxw...@gmail.com wrote:
I saw many people asked how to convert a RDD to a PairRDDFunctions. I would
like to
Yeah, this seems to be somewhat environment specific too. The same test has
been passing here for a while:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.1-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/lastBuild/consoleFull
2014-11-13 11:26 GMT-08:00 Michael Armbrust
We should implement this using cogroup(); it will just require some tracking to
map Python partitioners into dummy Java ones so that Java Spark’s cogroup()
operator respects Python’s partitioning. I’m sure that there are some other
subtleties, particularly if we mix datasets that use different
Ah right. This is because I'm running Java 8. This was fixed in
SPARK-3329
(https://github.com/apache/spark/commit/2b7ab814f9bde65ebc57ebd04386e56c97f06f4a#diff-7bfd8d7c8cbb02aa0023e4c3497ee832).
Consider back-porting it if other reasons arise, but this is specific
to tests and to Java 8.
On
Hi Mridul,
In the case Shivaram and I saw, and based on my understanding of Ma chong's
description, I don't think that completely fixes the problem.
To be very concrete, suppose your job has two tasks, t1 and t2, and they
each have input data (in HDFS) on h1 and h2, respectively, and that h1 and
Hi,
I am noticing the first step for Spark jobs does a TimSort in 1.2
branch...and there is some time spent doing the TimSort...Is this assigning
the RDD blocks to different nodes based on a sort order ?
Could someone please point to a JIRA about this change so that I can read
more about it ?
This sounds like it may be exactly the problem we've been having (and about
which I recently posted on the user list).
Is there any way of monitoring it's attempts to wait, giving up, and trying
another level?
In general, I'm trying to figure out why we can have repeated identical
jobs, the
If we put the `implicit` into pacakge object rdd or object rdd, when we
write `rdd.groupbykey()`, because rdd is an object of RDD, Scala compiler
will search `object rdd`(companion object) and `package object rdd`(pacakge
object) by default. We don't need to import them explicitly. Here is a post
No, the columnar buffer is built in a small batching manner, the batch
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize|
property. The default value for this in master and branch-1.2 is 10,000
rows per batch.
On 11/14/14 1:27 AM, Sadhan Sood wrote:
Thanks Chneg, Just
In the specific example stated, the user had two taskset if I
understood right ... the first taskset reads off db (dfs in your
example), and does some filter, etc and caches it.
Second which works off the cached data (which is, now, process local
locality level aware) to do map, group, etc.
The
Hi Ashutosh,
Please edit the README file.I think the following function call is
changed now.
|model = OutlierWithAVFModel.outliers(master:String, input dir:String ,
percentage:Double||)
|
Regards,
*Meethu Mathew*
*Engineer*
*Flytxt*
_http://www.linkedin.com/home?trk=hb_tab_home_top_
On
That seems like a great idea. Can you submit a pull request?
On Thu, Nov 13, 2014 at 7:13 PM, Shixiong Zhu zsxw...@gmail.com wrote:
If we put the `implicit` into pacakge object rdd or object rdd, when
we write `rdd.groupbykey()`, because rdd is an object of RDD, Scala
compiler will search
OK. I'll take it.
Best Regards,
Shixiong Zhu
2014-11-14 12:34 GMT+08:00 Reynold Xin r...@databricks.com:
That seems like a great idea. Can you submit a pull request?
On Thu, Nov 13, 2014 at 7:13 PM, Shixiong Zhu zsxw...@gmail.com wrote:
If we put the `implicit` into pacakge object rdd or
Hi,
I have a doubt regarding the input to your algorithm.
_http://www.linkedin.com/home?trk=hb_tab_home_top_
val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]],
percent : Double, sc :SparkContext)
Here our input data is an RDD[Vector[String]]. How we can create this
RDD
Please use the following snippet. I am still working on to make it a generic
vector, so that input
should not Vector[String] always. But String will work fine for now.
def main(args:Array[String])
{
val sc = new SparkContext(local, OutlierDetection)
val dir = hdfs://localhost:54310/train3
25 matches
Mail list logo