Looks like currently solution to update spark-stream jars/configurations is
to
1) save current Kafka offsets somewhere (say zookeeper)
2) shutdown the cluster and restart
3) connect to Kafka with previously saved offset
Assuming we're reading from Kafka which provides nice persistence and
Hi Jianshi,
For simulation purpose, I think you can try ConstantInputDStream and
QueueInputDStream to convert one RDD or series of RDD into DStream, the first
one output the same RDD in each batch duration, and the second one just output
a RDD in a queue in each batch duration. You can take a
I believe that's correct. See:
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features
2014년 10월 27일 월요일, agg212alexander_galaka...@brown.edu님이 작성한 메시지:
Hey, I'm trying to run TPC-H Query 4 (shown below), and get the following
error:
Exception in thread main
Hi Saisai,
I understand it's non-trivial, but the requirement of simulating offline
data as stream is also fair. :)
I just wrote a prototype, however, I need to do a collect and a bunch of
parallelize...
// RDD of (timestamp, value)
def rddToDStream[T: ClassTag](data: RDD[(Long, T)],
I think you solution may not be extendable if the data size is increasing,
since you have to collect all your data back to driver node, so the memory
usage of driver will be a problem.
why not filter out specific time-range data as a rdd, after filtering the whole
time-range, you will get a
It works fine on my *Spark 1.1.0*
Thanks
Best Regards
On Mon, Oct 27, 2014 at 12:22 AM, octavian.ganea octavian.ga...@inf.ethz.ch
wrote:
Hi Akhil,
Please see this related message.
http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-td17263.html
I am curious if this
There is no tool to tweak a spark cluster, but while writing the job, you
can consider the Tuning guidelines
http://spark.apache.org/docs/latest/tuning.html.
Thanks
Best Regards
On Mon, Oct 27, 2014 at 3:14 AM, Morbious knowledgefromgro...@gmail.com
wrote:
I wonder if there is any tool to
You're absolutely right, it's not 'scalable' as I'm using collect().
However, it's important to have the RDDs ordered by the timestamp of the
time window (groupBy puts data to corresponding timewindow).
It's fairly easy to do in Pig, but somehow I have no idea how to express it
in RDD...
You will face problems if the spark version isn't compatible with your
hadoop version. (Lets say you have hadoop 2.x and you downloaded spark
pre-compiled with hadoop 1.x then it would be a problem.) Of course you can
use spark without telling about any hadoop configurations unless you are
trying
Ok, back to Scala code, I'm wondering why I cannot do this:
data.groupBy(timestamp / window)
.sortByKey() // no sort method available here
.map(sc.parallelize(_._2.sortBy(_._1))) // nested RDD, hmm...
.collect() // returns Seq[RDD[(Timestamp, T)]]
Jianshi
On Mon, Oct 27, 2014 at 3:24
I think what you want is to make each bucket as a new RDD as what you mentioned
in Pig syntax.
gs = ORDER g BY group ASC, g.timestamp ASC // 'group' is the rounded timestamp
for each bucket
From my understanding, currently in Spark there’s no such kind of API to
achieve this, maybe you have
Yeah, you're absolutely right Saisai.
My point is we should allow this kind of logic in RDD, let's say
transforming type RDD[(Key, Iterable[T])] to Seq[(Key, RDD[T])].
Make sense?
Jianshi
On Mon, Oct 27, 2014 at 3:56 PM, Shao, Saisai saisai.s...@intel.com wrote:
I think what you want is to
There's an annoying small usability issue in HiveContext.
By default, it creates a local metastore which forbids other processes
using HiveContext to be launched from the same directory.
How can I make the metastore local to each HiveContext? Is there an
in-memory metastore configuration?
Yes, I understand what you want, but maybe hard to achieve without collecting
back to driver node.
Besides, can we just think of another way to do it.
Thanks
Jerry
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Monday, October 27, 2014 4:07 PM
To: Shao, Saisai
Cc:
Would you mind to share DDLs of all involved tables? What format are
these tables stored in? Is this issue specific to this query? I guess
Hive, Shark and Spark SQL all read from the same HDFS dataset?
On 10/27/14 3:45 PM, lyf刘钰帆 wrote:
Hi,
I am using SparkSQL 1.1.0 with cdh 4.6.0
Sure, let's still focus on the streaming simulation use case. It's a very
useful problem to solve.
If we're going to use the same Spark-streaming core for the simulation, the
most simple way is to have a globally sorted RDDs and use ssc.queueStream.
Thus collecting the Key part to driver is
Hi Stephen,
I tried it again,
To avoid the profile impact, I execute mvn -DskipTests clean package with
Hadoop 1.0.4 by default and open the IDEA and import it as a maven project,
and I didn't choose any profile in the import wizard.
Then Make project or re-build project in IDEA, unfortunately
Any suggestion? :)
Jianshi
On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s
per topic).
Which configuration do you recommend?
- 1 Spark app consuming all Kafka topics
- 10 separate Spark
Hi,
Try running following in the spark folder:
bin/*run-example *SparkPi 10
If this runs fine, just see the set of arguments being passed via this
script, and try in similar way.
Thanks,
On Thu, Oct 16, 2014 at 2:59 PM, Christophe Préaud
christophe.pre...@kelkoo.com wrote:
Hi,
I have
Hi, all
We would expect to utilize sort-based shuffle in our spark application but
had encounted unhandled problems.
It seems that data file and index file are not in consistence state and we
got wrong result sets when trying to use
spark to bulk load data into hbase.
There are many
Hi,
Probably the problem you met is related to this JIRA ticket
(https://issues.apache.org/jira/browse/SPARK-3948). It's potentially a Kernel
2.6.32 bug which will make sort-based shuffle failed. I'm not sure your problem
is the same as this one, would you mind checking your kernel version?
Apache Spark Team,
We recently started experimenting with Apache Spark for high speed data
retrieval. We downloaded Apache Spark Source Code; Installed Git; Packaged
the assebly; Installed Scala and ran some examples mentioned in the
documentation. We did all these steps on WindowsXP. Till this
Hm, these DDLs look quite normal. Forgot to ask what version of Spark
SQL are you using? Can you reproduce this issue with the master branch?
Also, a small sample of input data that can reproduce this issue can be
very helpful. For example, you can run SELECT * FROM tbl TABLESAMPLE (10
ROWS)
So the problem was that Spark has internaly set home to /home. Hack to make
this work with Python is to add before call of textblob line:
os.environ['HOME'] = '/home/hadoop'
__
Maybe I'll add one more question. I think that the
Hi Spark Users,
I am trying to use the Akka Camel library together with Spark Streaming and
it is not working when I deploy my Spark application to the Spark Cluster.
It does work when I run the application locally so this seems to be an
issue with how Spark loads the reference.conf file from the
On 10/27/2014 05:19 AM, Jianshi Huang wrote:
Any suggestion? :)
Jianshi
On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang
jianshi.hu...@gmail.com mailto:jianshi.hu...@gmail.com wrote:
The Kafka stream has 10 topics and the data rate is quite high (~
100K/s per topic).
Which
Apologies, was not finished typing.
14/10/23 09:31:30 ERROR OneForOneStrategy: No configuration setting found
for key 'akka.camel'
akka.actor.ActorInitializationException: exception during creation
This error means that the Akka Camel reference.conf is not being loaded,
even though it is in the
I'm quite interested in this as well. I remember something about a streaming
context needing one core. If that's the case, then won't 10 apps require 10
cores? Seems like a waste unless each topic is quite resource hungry? Would
love to hear from the experts :)
Date: Mon, 27 Oct 2014 06:35:29
Hi,
I think the problem is caused by Data Serialization. You can follow the link
https://spark.apache.org/docs/latest/tuning.html
https://spark.apache.org/docs/latest/tuning.html to register your class
testing.
For pyspark 1.1.0, there is a problem about the default serializer.
Hello all,
I am trying to run a Spark job but while running the job I am getting missing
class errors. First I got a missing class error for Joda DateTime, I added the
dependency in my build.sbt and rebuilt the assembly jar and tried again. This
time I am getting the same error for
If you want to calculate mean, variance, minimum, maximum and total count
for each columns, especially for features of machine learning, you can try
MultivariateOnlineSummarizer.
MultivariateOnlineSummarizer implements a numerically stable algorithm to
compute sample mean and variance by column in
This can happen due to jar conflict, meaning one of the jar (possibly the
first one apache-cassandra-thrift*jar) is having a class named
*ITransportFactory *but it doesn't contain the method *openTransport()* in
it. You can make sure the jar is having *ITransportFactory *class in it and
it also
We encountered some special OOM cases of cogroup when the data in one
partition is not balanced.
1. The estimated size of used memory is inaccurate. For example, there are
too many values for some special keys. Because SizeEstimator.visitArray
only samples at most 100 cells for an array, it may
Hi:
After update spark to version1.1.0, I experienced a snappy error which
was
posted here
http://apache-spark-user-list.1001560.n3.nabble.com/Update-gcc-version-Still-snappy-error-tt15137.html
. I avoid this problem with
Hi Sasi,
Thrift is not needed to integrate Cassandra with Spark. In fact the only dep
you need is spark-cassandra-connector_2.10-1.1.0-alpha3.jar, and you can
upgrade to alpha4; we’re publishing beta very soon. For future reference,
questions/tickets can be created
I have never tried this yet, but maybe you can use an in-memory Derby
database as metastore
https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html
I'll investigate this when free, guess we can use this for Spark SQL
Hive support testing.
On 10/27/14 4:38 PM, Jianshi Huang
I agree. I'd like to avoid SQL
If I could store everything in Cassandra or Mongo and process in Spark,
that would be far preferable to creating a temporary Working Set.
I'd like to write a performance test. Lets say I have two large
collections A and B. Each collection has 2 columns and many
Please see
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-EmbeddedMetastore
Cheers
On Oct 27, 2014, at 6:20 AM, Cheng Lian lian.cs@gmail.com wrote:
I have never tried this yet, but maybe you can use an in-memory Derby
database as
unsubscribe
Thanks Ted, this is exactly what Spark SQL LocalHiveContext does. To
make an embedded metastore local to a single HiveContext, we must
allocate different Derby database directories for each HiveContext, and
Jianshi is also trying to avoid that.
On 10/27/14 9:44 PM, Ted Yu wrote:
Please see
Take a look at the first section of:
http://spark.apache.org/community
Cheers
On Mon, Oct 27, 2014 at 6:50 AM, Ian Ferreira ianferre...@hotmail.com
wrote:
unsubscribe
Send an email to user-unsubscr...@spark.apache.org to unsubscribe.
Thanks
Best Regards
On Mon, Oct 27, 2014 at 7:41 PM, Ted Yu yuzhih...@gmail.com wrote:
Take a look at the first section of:
http://spark.apache.org/community
Cheers
On Mon, Oct 27, 2014 at 6:50 AM, Ian Ferreira
Hi,
I'm trying to run the SparkPi example using `--deploy-mode cluster`. Not
having much luck. Looks like the JAVA_HOME on my Mac is used for launching
the app on the Spark cluster running on CentOS. How can I set the JAVA_HOME
or better yet, how can I configure Spark to use the JAVA_HOME
You can do *which java* command and see where exactly is your java being
installed and export it as the JAVA_HOME.
[image: Inline image 1]
Thanks
Best Regards
On Mon, Oct 27, 2014 at 7:56 PM, Thomas Risberg trisb...@pivotal.io wrote:
Hi,
I'm trying to run the SparkPi example using
Hi Professor Lin,
It will be great if you could please review the TRON code in breeze and
whether it is similar to the original TRON implementation...Breeze is
already integrated in mllib (we are using BFGS and OWLQN is under works for
mllib LogisticRegression) and comparing with TRON should be
I am working on running the following hive query from spark.
/SELECT * FROM spark_poc.table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY
SORT BY IP_ADDRESS, COOKIE_ID/
Ran into /java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;/
(complete
Hi, Spark Users:
I tried to test the spark in a standalone box, but faced an issue which I don't
know what is the root cause. I basically followed exactly document of deploy
spark in a standalone environment.
1) I check out spark source code of release 1.1.02) I build the spark with
following
Hi,
I have a stand alone Spark Cluster, where worker and master reside on the
same machine. I submit a job to the cluster, the job is executed for a
while and suddenly I get this exception with no additional trace.
ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2490dce9
Hello,
I have been prototyping a text classification model that my company would
like to eventually put into production. Our technology stack is currently
Java based but we would like to be able to build our models in Spark/MLlib
and then export something like a PMML file which can be used for
I did a little more research about this. It looks like the worker started
successfully, but on port 40294. This is shown in both log and master web UI.
The question is that in the log, the master akka.tcp is trying to connect to
another different port (44017). Why?
Yong
From:
Thanks Ted and Cheng for the in memory derby solution. I'll check it out. :)
And to me, using in-mem by default makes sense, if user wants a shared
metastore, it needs to be specified. An 'embedded' local metastore in the
working directory barely has a use case.
Jianshi
On Mon, Oct 27, 2014
Setting the following while creating the sparkContext will sort it out.
.set(spark.core.connection.ack.wait.timeout,600)
.set(spark.akka.frameSize,50)
On 27 Oct 2014 21:15, shahab shahab.mok...@gmail.com wrote:
Hi,
I have a stand alone Spark Cluster, where worker and master
On Monday, October 27, 2014, Shixiong Zhu zsxw...@gmail.com wrote:
We encountered some special OOM cases of cogroup when the data in one
partition is not balanced.
1. The estimated size of used memory is inaccurate. For example, there are
too many values for some special keys. Because
I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works if I
run it in local mode! )
If I put the accumulator inside the for loop, everything will work fine. I
guess the bug is that an accumulator can be applied to JUST one RDD.
Still another undocumented 'feature' of Spark
Can you post the error message you get when trying to save the sequence
file? If you call first() on the RDD does it result in the same error?
On Mon, Oct 27, 2014 at 6:13 AM, buring qyqb...@gmail.com wrote:
Hi:
After update spark to version1.1.0, I experienced a snappy error
which
So, apparently `wholeTextFiles` runs the job again, passing null as argument
list, which in turn blows up my argument parsing mechanics. I never thought I
had to check for null again in a pure Scala environment ;)
On 26.10.2014, at 11:57, Marius Soutier mps@gmail.com wrote:
I tried that
We are working on the pipeline features, which would make this
procedure much easier in MLlib. This is still a WIP and the main JIRA
is at:
https://issues.apache.org/jira/browse/SPARK-1856
Best,
Xiangrui
On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani
chirag.lakh...@gmail.com wrote:
Hello,
I
Hi,
I want to test the performance of MapReduce, and Spark on a program, find
the bottleneck, calculating the performance of each part of the program and
etc. I was wondering if there is tool for the measurement like Galia and
etc. to help me in this regard.
Thanks!
--
View this message in
After trying around with few permutations I was able to get rid of this error
by changing the datastax-cassandra-connector version to 1.1.0.alpha. Could it
be related to a bug in alpha4? But why would it manifest itself as a class
loading error?
On 27 Oct 2014, at 12:10, Saket Kumar
No such method error almost always means you are mixing different versions
of the same library on the classpath. In this case it looks like you have
more than one version of guava. Have you added anything to the classpath?
On Mon, Oct 27, 2014 at 8:36 AM, nitinkak001 nitinkak...@gmail.com
I'd suggest checking out the Spark SQL programming guide to answer this
type of query:
http://spark.apache.org/docs/latest/sql-programming-guide.html
You could also perform it using the raw Spark RDD API
http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD,
but its
Yes, I added all the Hive jars present in Cloudera distribution of Hadoop.
I added them because I was getting ClassNotFoundException for many required
classes(one example stack trace below). So, someone on the community
suggested to include the hive jars:
*Exception in thread main
Which version of CDH are you using? I believe that hive is not correctly
packaged in 5.1, but should work in 5.2. Another option that people use is
to deploy the plain Apache version of Spark on CDH Yarn.
On Mon, Oct 27, 2014 at 11:10 AM, Nitin kak nitinkak...@gmail.com wrote:
Yes, I added
I am now on CDH 5.2 which has the Hive module packaged in it.
On Mon, Oct 27, 2014 at 2:17 PM, Michael Armbrust mich...@databricks.com
wrote:
Which version of CDH are you using? I believe that hive is not correctly
packaged in 5.1, but should work in 5.2. Another option that people use is
Hi,
I wanted to understand what kind of memory overheads are expected if at all
while using the Java API. My application seems to have a lot of live Tuple2
instances and I am hitting a lot of gc so I am wondering if I am doing
something fundamentally wrong. Here is what the top of my heap looks
Hello all - I am attempting to run MLLib's ALS algorithm on a substantial
test vector - approx. 200 million records.
I have resolved a few issues I've had with regards to garbage collection,
KryoSeralization, and memory usage.
I have not been able to get around this issue I see below however:
Great. Thank you very much Michael :-D
On Mon, Oct 27, 2014 at 2:03 PM, Michael Armbrust mich...@databricks.com
wrote:
I'd suggest checking out the Spark SQL programming guide to answer this
type of query:
http://spark.apache.org/docs/latest/sql-programming-guide.html
You could also
This is known issue with PySpark, the class and objects of custom
class in current script can not serialized by pickle between driver
and worker
You can workaround this by put 'testing' in a module, and sending this
module to cluster by `sc.addPyFile`
Davies
On Sun, Oct 26, 2014 at 11:57 PM,
If I already have hive running on Hadoop, do I need to build Hive using
sbt/sbt -Phive assembly/assembly
command?
If the answer is no, how do I tell spark where hive home is?
Thanks,
Hi all,
I’m submitting a simple task using the spark shell against a cassandraRDD (
Datastax Environment ).
I’m getting the following eception from one of the workers:
INFO 2014-10-27 14:08:03 akka.event.slf4j.Slf4jLogger: Slf4jLogger started
INFO 2014-10-27 14:08:03 Remoting: Starting
Hi,
I've come across this multiple times, but not in a consistent manner. I found
it hard to reproduce. I have a jira for it: SPARK-3080
Do you observe this error every single time? Where do you load your data from?
Which version of Spark are you running?
Figuring out the similarities may
Hi Burak.
I always see this error. I'm running the CDH 5.2 version of Spark 1.1.0. I
load my data from HDFS. By the time it hits the recommender it had gone
through many spark operations.
On Oct 27, 2014 4:03 PM, Burak Yavuz bya...@stanford.edu wrote:
Hi,
I've come across this multiple times,
Hi,
Is the following guaranteed to always provide an exact count?
foreachRDD(foreachFunc = rdd = {
rdd.count()
In the literature it mentions However, output operations (like foreachRDD)
have *at-least once* semantics, that is, the transformed data may get
written to an external entity more
Hi Josh,
The count() call will result in the correct number in each RDD, however
foreachRDD doesn't return the result of its computation anywhere (its
intended for things which cause side effects, like updating an accumulator
or causing an web request), you might want to look at transform or the
Spark is written in Scala. As a strongly typed functional programming
language, Scala makes Spark possible. The API of Spark mirrors Scala
collections API, driving adoption through beautifully intuitive data
transformations which work both locally and at scale. And where's
more inspiration
We have a table containing 25 features per item id along with feature weights.
A correlation matrix can be constructed for every feature pair based on
co-occurrence. If a user inputs a feature they can find out the features that
are correlated with a self-join requiring a single full table
I am a Scala / Spark newbie (attending Paco Nathan's class).
What I need is some advice as to how to set up intellij (or eclipse) to be
able to attache to the process executing to the debugger. I know that this
is not feasible if the code is executing within the cluster. However, if
spark is
Hello,
I am able to submit Job on Spark cluster from Windows desktop. But the
executors are not able to run.
When I check the Spark UI (which is on Windows, as Driver is there) it shows
me JAVA_HOME, CLASS_PATH and other environment variables related to Windows.
I tried by setting
Somehow worked by placing all the jars(except guava) in hive lib after
--jars. Had initially tried to place the jars under another temporary
folder and pointing the executor and driver extraClassPath to that
director, but didnt work.
On Mon, Oct 27, 2014 at 2:21 PM, Nitin kak
You can access cached data in spark through the JDBC server:
http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub ronalday...@live.com wrote:
We have a table containing 25 features per item id along with
Hello,
I got the following query:
SELECT id,
Count(*) AS amount
FROM table1
GROUP BY id
HAVING amount = (SELECT Max(mamount)
FROM (SELECT id,
Count(*) AS mamount
FROM table1
Hi,
I was trying to set up Spark SQL on a private cluster. I configured a
hive-site.xml under spark/conf that uses a local metestore with warehouse and
default FS name set to HDFS on one of my corporate cluster. Then I started
spark master, worker and thrift server. However, when creating a
hey guys
Anyone using CDH Spark StandaloneI installed Spark standalone thru Cloudera
Manager
$ spark-shell --total-executor-cores 8
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/bin/../lib/spark/bin/spark-shell:
line 44:
Could you save the data before ALS and try to reproduce the problem?
You might try reducing the number of partitions and not using Kryo
serialization, just to narrow down the issue. -Xiangrui
On Mon, Oct 27, 2014 at 1:29 PM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi Burak.
I always see this
Hi Xiangrui - I can certainly save the data before ALS - that would be a
great first step. Why would reducing the number of partitions help? I
would very much like to understand what¹s happening internally. Also, with
regards to Burak¹s earlier comment, here is the JIRA referencing this
problem.
It works fine for me on 5.2.0. But I think you are hitting:
https://issues.cloudera.org/browse/DISTRO-649
I am not sure what circumstances cause it to not be included, since it
works on my cluster, but this is apparently fixed in 5.2.1 where it is
a problem. I assume that follows quite shortly
So it dosen't matter which dialect im using? Caus i set spark.sql.dialect to
sql.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-tp17401p17408.html
Sent from the Apache Spark User List mailing list archive at
Would the rdd resulting from the below query be partitioned on GEO_REGION,
GEO_COUNTRY? I ran some tests(using mapPartitions on the resulting RDD) and
seems that there are always 50 partitions generated while there should be
around 1000.
/SELECT * FROM spark_poc.table_nameDISTRIBUTE BY
Hi to all,
I'm trying to convert my old mapreduce job to a spark one but I have some
doubts..
My application basically buffers a batch of updates and every 100 elements
it flushes the batch to a server. This is very easy in mapreduce but I
don't know how you can do that in scala..
For example, if
Yeah, sorry for being unclear. Subquery expressions are not supported.
That particular error was coming from the Hive parser.
On Mon, Oct 27, 2014 at 4:03 PM, Daniel Klinger d...@web-computing.de wrote:
So it dosen't matter which dialect im using? Caus i set spark.sql.dialect
to
sql.
--
I'm running spark locally on my laptop to explore how persistence impacts
memory use. I'm generating 80 MB matrices in numpy and then simply adding
them as an example problem.
No matter what I set NUM or persistence level to in the code below, I get
out of memory errors like (
Hi,
How could I combine rdds? I would like to combine two RDDs if the count in
an RDD is not above some threshold.
Thanks,
Josh
Hi, I was wondering how to implement fixed sized strings in Spark SQL. I
would like to implement TPC-H, which uses fixed sized strings for certain
fields (i.e., 15 character L_SHIPMODE field).
Is there a way to use a fixed length char array instead of using a string?
--
View this message in
Hey lordjoe,
Apologies for the late reply.
I followed your threadlocal approach and it worked fine. I will update the
thread if I get to know more on this.
(Don't know how Spark Scala does it but what I wanted to achieve in java is
quiet common in many spark-scala github gists)
Thanks.
On
this requires evaluation of the rdd to do the count.
val x: RDD[X] = ...
val y: RDD[X] = ...
x.cache
val z = if(x.count thres) x.union(y) else x
On Oct 27, 2014 7:51 PM, Josh J joshjd...@gmail.com wrote:
Hi,
How could I combine rdds? I would like to combine two RDDs if the count in
an RDD is
guoxu1231, I struggled with the Idea problem for a full week. Same thing --
clean builds under MVN/Sbt, but no luck with IDEA. What worked for me was
the solution posted higher up in this thread -- it's a SO post that
basically says to delete all iml files anywhere under the project
directory.
You need to change the Scala compiler from IntelliJ to “sbt incremental
compiler” (see the
screenshot below).
You can access this by going to “preferences” “scala”.
NOTE: This is supported only for certain version of IntelliJ scala plugin.
See this link for details.
This does look like it provides a good way to allow other process to access the
contents of an RDD in a separate app? Is there any other general purpose
mechanism for serving up RDD data? I understand that the driver app and workers
all are app specific and run in separate executors but would
You can change spark.shuffle.safetyFraction , but that is a really big
margin to add.
The problem is that the estimated size of used memory is inaccurate. I dig
into the codes and found that SizeEstimator.visitArray randomly selects 100
cell and use them to estimate the memory size of the whole
Here is error log,I abstract as follows:
INFO [binaryTest---main]: before first
WARN [org.apache.spark.scheduler.TaskSetManager---Result resolver
thread-0]: Lost task 0.0 in stage 0.0 (TID 0, spark-dev136):
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
1 - 100 of 110 matches
Mail list logo