You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:
I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.
I'm
Could you be more specific in how this is done?
A DataFrame class doesn't have that method.
On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote:
You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:
Exception in thread main java.lang.RuntimeException:
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot
communicate with client version 4
I am not using any hadoop facility (not even hdfs) then why it is giving
this error .
--
Thanks Regards,
Anshu Shukla
Hi,
I have two small RDD, each has about 600 records. In my code, I did
val rdd1 = sc...cache()
val rdd2 = sc...cache()
val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b) =
some_expensive_job(a,b)
}
I ran my job in YARN cluster with --master yarn-cluster, I have 6
Hi!
I am new at Spark and I want to begin Spark with simple wordCount example
in Java. But I want to give my input from Mongodb database. I want to learn
how can I connect Mongodb database to my project. Any one can help for this
issue.
Have a nice day
yasemin
--
hiç ender hiç
Can you paste the complete stacktrace? It looks like you are having version
incompatibility with hadoop.
Thanks
Best Regards
On Sat, May 2, 2015 at 4:36 PM, drarse drarse.a...@gmail.com wrote:
When I run my program with Spark-Submit everythink are ok. But when I try
run in satandalone mode I
Hi,
Is there any way to enforce Spark to partition cached data across all
worker nodes, so all data is not cached only in one of the worker nodes?
best,
/Shahab
With filestream you can actually pass a filter parameter to avoid loading
up .tmp file/directories.
Also, when you move/rename a file, the file creation date doesn't change
and hence spark won't detect them i believe.
Thanks
Best Regards
On Sat, May 2, 2015 at 9:37 PM, Evo Eftimov
Looks like a version incompatibility, just make sure you have the proper
version of spark. Also look further in the stacktrace what is causing
Futures timed out (it could be a network issue also if the ports aren't
opened properly)
Thanks
Best Regards
On Sat, May 2, 2015 at 12:04 AM,
500GB of data will have nearly 3900 partitions and if you can have nearly
that many number of cores and around 500GB of memory then things will be
lightening fast. :)
Thanks
Best Regards
On Sun, May 3, 2015 at 12:49 PM, sherine ahmed sherine.sha...@hotmail.com
wrote:
I need to use spark to
Hello Dean Others,
Thanks for the response.
I tried with 100,200, 400, 600 and 1200 repartitions with 100,200,400 and
800 executors. Each time all the tasks of join complete in less than a
minute except one and that one tasks runs forever. I have a huge cluster at
my disposal.
The data for each
IMHO If your data or your algorithm is prone to data skew, I think you have
to fix this from application level, Spark itself cannot overcome this
problem (if one key has large amount of values), you may change your
algorithm to choose another shuffle key, somethings like this to avoid
shuffle on
Hi
How do you figure out 500gig~3900 partitions? I am trying to do the math.
If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we
assume split and block sizes are same, shouldn't we end up with 8k
partitions?
On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote:
Hello Shao,
Can you talk more about shuffle key or point me to APIs that allow me to
change shuffle key. I will try with different keys and see the performance.
What is the shuffle key by default ?
On Mon, May 4, 2015 at 2:37 PM, Saisai Shao sai.sai.s...@gmail.com wrote:
IMHO If your data or
Assume your block size is 128MB.
Thanks
Best Regards
On Mon, May 4, 2015 at 2:38 PM, ayan guha guha.a...@gmail.com wrote:
Hi
How do you figure out 500gig~3900 partitions? I am trying to do the math.
If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we
assume split and
Shuffle key is depending on your implementation, I'm not sure if you are
familiar with MapReduce, the mapper output is a key-value pair, where the
key is the shuffle key for shuffling, Spark is also the same.
2015-05-04 17:31 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
Hello Shao,
Can you
This is how i implemented map-side join using broadcast.
val listings = DataUtil.getDwLstgItem(sc,
DateUtil.addDaysToDate(startDate, -89))
val viEvents = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) }
val lstgItemMap = listings.map { lstg = (lstg.getItemId().toLong,
lstg)
One dataset (RDD Pair)
val lstgItem = listings.map { lstg = (lstg.getItemId().toLong, lstg) }
Second Dataset (RDDPair)
val viEvents = viEventsRaw.map { vi = (vi.get(14).asInstanceOf[Long], vi) }
As i want to join based on item Id that is used as first element in the
tuple in both cases and i
Four tasks are now failing with
IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch TimeDurationGC
TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk)
Errors 0 3771 0 FAILED PROCESS_LOCAL 114 / host1 2015/05/04 01:27:44
/ ExecutorLostFailure
(executor 114 lost)
I got it working. When I was persisting, it only persisted 85% of the RDD to
memory and the rest of the RDD gets recomputed every time. Because my
flagged RDD uses a random method to create the field, I was getting
unpredictable results. When I persist using:
Hi,
I'd like to use a JavaRDD containing parameters for an SQL query, and use
SparkSQL jdbc to load data from mySQL.
Consider the following pseudo code:
JavaRDDString namesRdd = ... ;
...
options.put(url, jdbc:mysql://mysql?user=usr);
options.put(password, pass);
options.put(dbtable, (SELECT *
From the symptoms you mentioned that one task's shuffle write is much
larger than all the other task, it is quite similar to normal data skew
behavior, I just give some advice based on your descriptions, I think you
need to detect whether data is actually skewed or not.
The shuffle will put data
Hi,
What is you Spark version ?
Regards,
Olivier.
Le lun. 4 mai 2015 à 11:03, luohui20...@sina.com a écrit :
hi guys
when i am running a sql like select a.name,a.startpoint,a.endpoint,
a.piece from db a join sample b on (a.name = b.name) where (b.startpoint
a.startpoint + 25); I
I ran it against one file instead of 10 files and i see one task is still
running after 33 mins its shuffle read size is 780MB/50 mil records.
I did a count of records for each itemId from dataset-2 [One FILE] (Second
Dataset (RDDPair) val viEvents = viEventsRaw.map { vi = (vi.get(14
Why do you use custom partitioner ?
Are you doing join ?
And, can you share some code that shows how you implemented custom
partitioner.
On Tue, Apr 21, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote:
Are you looking for?
*mapPartitions*(*func*)Similar to map, but runs separately on
Hello Soila,
Can you share the code that shows usuag of RangePartitioner ?
I am facing issue with .join() where one task runs forever. I tried
repartition(100/200/300/1200) and it did not help, I cannot use map-side
join because both datasets are huge and beyond driver memory size.
Regards,
Deepak
HI,
I am trying to build a example code given at
https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
code is:
// Import factory methods provided by DataType.import
org.apache.spark.sql.types.DataType;// Import StructType and
StructFieldimport
*Datasets*
val viEvents = viEventsRaw.map { vi = (vi.get(14).asInstanceOf[Long], vi) }
val lstgItem = listings.map { lstg = (lstg.getItemId().toLong, lstg) }
What is the difference between
1)
lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions =
1200, rdd = viEvents)).map
Hi all,
I am trying to understand the output of the SVM classifier.
Right now, my output looks like this:
-18.841544889249917 0.0
168.32916035523283 1.0
420.67763915879794 1.0
-974.1942589201286 0.0
71.73602841256813 1.0
233.13636224524993 1.0
-1000.5902168199027 0.0
The
Hi Saurabh,
Did you check the log of maven?
2015-05-04 15:17 GMT+02:00 Saurabh Gupta saurabh.gu...@semusi.com:
HI,
I am trying to build a example code given at
https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds
code is:
// Import factory methods
I think I figured it out. I am playing around with the Cassandra connector
and I had a method that inserted some data into a locally-running Cassandra
instance, but I forgot to close the Cluster object. I guess that left some
non-daemon thread running and kept the process for exiting. Nothing
Hi Robert,
I would say, taking the sign of the numbers represent the class of the
input-vector. What kind of data are you using, and what kind of traning-set
do you use. Fundamentally a SVM is able to separate only two classes, you
can do one vs the rest as you mentioned.
I don't see how LVQ can
I'm trying to deal with some code that runs differently on Spark
stand-alone mode and Spark running on a cluster. Basically, for each item
in an RDD, I'm trying to add it to a list, and once this is done, I want to
send this list to Solr.
This works perfectly fine when I run the following code in
I am really new to this but what should I look into maven logs?
I have tried mvn package -X -e
SHould I show the full trace?
On Mon, May 4, 2015 at 6:54 PM, Driesprong, Fokko fo...@driesprong.frl
wrote:
Hi Saurabh,
Did you check the log of maven?
2015-05-04 15:17 GMT+02:00 Saurabh Gupta
I'm not familiar with the Solr API but provided that ' SolrIndexerDriver'
is a singleton, I guess that what's going on when running on a cluster is
that the call to:
SolrIndexerDriver.solrInputDocumentList.add(elem)
is happening on different singleton instances of the SolrIndexerDriver on
You can use applySchema
On Mon, May 4, 2015 at 10:16 PM, Lior Chaga lio...@taboola.com wrote:
Hi,
I'd like to use a JavaRDD containing parameters for an SQL query, and use
SparkSQL jdbc to load data from mySQL.
Consider the following pseudo code:
JavaRDDString namesRdd = ... ;
...
I have the following simple example program:
public class SimpleCount {
public static void main(String[] args) {
final String master = System.getProperty(spark.master,
local[*]);
System.out.printf(Running job against spark master %s ...%n,
master);
final SparkConf
Hi,
I am getting No space left on device exception when doing repartitioning
of approx. 285 MB of data while these is still 2 GB space left ??
does it mean that repartitioning needs more space (more than 2 GB) for
repartitioning of 285 MB of data ??
best,
/Shahab
java.io.IOException: No
Are you using only 1 executor?
On Mon, May 4, 2015 at 11:07 PM, luohui20...@sina.com wrote:
hi Olivier
spark1.3.1, with java1.8.0.45
and add 2 pics .
it seems like a GC issue. I also tried with different parameters like
memory size of driverexecutor, memory fraction, java opts...
but
Hi,
I think you need to import org.apache.spark.sql.types.DataTypes instead
of org.apache.spark.sql.types.DataType and use that instead to access the
StringType..
HTH,
Deng
On Mon, May 4, 2015 at 9:37 PM, Saurabh Gupta saurabh.gu...@semusi.com
wrote:
I am really new to this but what should I
Hi Yasemin,
You can find here a MongoDB connector for Spark SQL:
http://github.com/Stratio/spark-mongodb
Best regards
2015-05-04 9:27 GMT+02:00 Yasemin Kaya godo...@gmail.com:
Hi!
I am new at Spark and I want to begin Spark with simple wordCount example
in Java. But I want to give my input
See
https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available
What's the value for spark.local.dir property ?
Cheers
On Mon, May 4, 2015 at 6:57 AM, shahab shahab.mok...@gmail.com wrote:
Hi,
I am getting No space left on device exception
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
The arguments are sql string, lower bound, upper bound, number of
partitions.
Your call SELECT * FROM MEMBERS LIMIT ? OFFSET ?, 0, 100, 1
would thus be run as
SELECT * FROM MEMBERS LIMIT 0 OFFSET 100
Hi,
What kind of can't find symbol are you receiving?
On the other hand, I would try to change guava dependency version to 14.0.1.
In Spark 1.3.0, guava version is 14.0.1 but is not included inside spark
artifact because it's marked like provided.
Hi, I'm trying to parse log files generated by Spark using SparkSQL.
In the JSON elements related to the StageCompleted event we have a nested
structre containing an array of elements with RDD Info. (see the log below
as an example (omitting some parts).
{
Event: SparkListenerStageCompleted,
I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and
improve query performance, but I found it took long time as same as querying
not using LIMIT. It let me confused. Anybody know why?
Thanks.
Regards,Yi
Structure seems fine. Only need to type at the end of your program:
ssc.start();
ssc.awaitTermination();
Check method arguments. I advise you to check the spark java api streaming.
https://spark.apache.org/docs/1.3.0/api/java/
Regards.
--
View this message in context:
In regards to the large GC pauses, assuming you allocated all 100GB of
memory per worker you may consider running with less memory on your Worker
nodes, or splitting up the available memory on the Worker nodes amongst
several worker instances. The JVM's garbage collection starts to become
very
What query are you running. It may be the case that your query requires
PosgreSQL to do a large amount of work before identifying the first n rows
On 4 May 2015, at 15:52, Yi Zhang zhangy...@yahoo.com.INVALID wrote:
I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and
and a further question - have you tried running this query in pqsl? what’s the
performance like there?
On 4 May 2015, at 16:04, Robin East robin.e...@xense.co.uk wrote:
What query are you running. It may be the case that your query requires
PosgreSQL to do a large amount of work before
Hi
Can someone share some working code for custom partitioner in python?
I am trying to understand it better.
Here is documentation
partitionBy(*numPartitions*, *partitionFunc=function portable_hash at
0x2c45140*)
I have implemented map-side join with broadcast variables and the code is
on mailing list (scala).
On Mon, May 4, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote:
Hi
Can someone share some working code for custom partitioner in python?
I am trying to understand it better.
Here is
I tried this
val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))]
= lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions
= 1200, rdd = viEvents)).map {
It fired two jobs and still i have 1 task that never completes.
IndexIDAttemptStatusLocality
Full disclosure, I am *brand* new to Spark.
I am trying to use [Py]SparkSQL standalone to pre-process a bunch of *local*
(non HDFS) Parquet files. I have several thousand files and want to dispatch
as many workers as my machine can handle to process the data in parallel;
either at the per-file
Hi,
I am trying create a DashBoard of a job of Apache Spark. I need run Spark
Streaming 24/7 and when recive a ajax request this answer with the actual
state of the job. I have created the client, and the program in Spark. I
tried create the service of response with play, but this run the
Can I specify this in my build file ?
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/com-datastax-spark-spark-streaming-2-10-1-1-0-in-my-build-sbt-tp22758.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm submitting this via 'dse spark-submit' but somehow, I don't see any
loggings in my cluster or worker machines...
How can I find out ?
My cluster is running DSE 4.6.1 with Spark enabled.
My source is running Kafka 0.8.2.0
I'm launching my program on one of my DSE machines.
Any insights much
Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760).
The spark (1.3.1) job is allocated 120 executors with 6GB each and the
driver also has 6GB.
Spark Config Params:
.set(spark.hadoop.validateOutputSpecs,
false).set(spark.dynamicAllocation.enabled,
hi all,
when i use submit a spark-sql programe to select data from my hive
database I get an error like this:
User class threw exception: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my
spark configure ,thank
You can use
Explain extended select ….
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 05, 2015 9:52 AM
To: Cheng, Hao; Olivier Girardot; user
Subject: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.
As I know broadcastjoin is automatically enabled by
I see a Jira for this one, but unresolved.
https://issues.apache.org/jira/browse/SPARK-1517
https://issues.apache.org/jira/browse/SPARK-1517
On May 4, 2015, at 10:25 PM, Ankur Chauhan achau...@brightcove.com wrote:
Hi,
Does anyone know if spark has any nightly builds or equivalent
Can you print out the physical plan?
EXPLAIN SELECT xxx…
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复:Re: sparksql running slow while joining 2 tables.
hi Olivier
spark1.3.1, with java1.8.0.45
and add 2 pics
Actually, after some digging, I did find a JIRA for it: SPARK-5470.
The fix for this has gone into master, but it isn't in 1.2.
On Mon, May 4, 2015 at 2:47 PM, Imran Rashid iras...@cloudera.com wrote:
Oh, this seems like a real pain. You should file a jira, I didn't see an
open issue -- if
guys, just to confirm, sparksql support hive feature view, is that the one
LateralView in hive language manual?
thanks
Thanksamp;Best regards!
罗辉 San.Luo
See this related thread:
http://search-hadoop.com/m/JW1q5bnnyT1
Cheers
On Mon, May 4, 2015 at 7:58 PM, Guru Medasani gdm...@gmail.com wrote:
I see a Jira for this one, but unresolved.
https://issues.apache.org/jira/browse/SPARK-1517
On May 4, 2015, at 10:25 PM, Ankur Chauhan
Hi,
There is also a make-distribution.sh file in the repository root. If someone
with jenkins access can create a simple builder that would be awesome.
But I am guessing besides the spark binary one would also probably want the
maven artifacts (lower priority though) to work with it.
-- Ankur
I am trying to distribute a table using a particular column which is the key
that I’ll be using to perform join operations on the table. Is it possible to
do this with Spark SQL?
I checked the method partitionBy() for rdds. But not sure how to specify which
column is the key? Can anyone give an
We support both LATERAL VIEWs (a query language feature that lets you turn
a single row into many rows, for example with an explode) and virtual views
(a table that is really just a query that is run on demand).
On Mon, May 4, 2015 at 7:12 PM, luohui20...@sina.com wrote:
guys,
just to
If you do a join with at least one equality relationship between the two
tables, Spark SQL will automatically hash partition the data and perform
the join.
If you are looking to prepartition the data, that information is not yet
propagated from the in memory cached representation so won't help
Robin,My query statement is as below:select id, name, trans_date, gender,
hobby, job, country from Employees LIMIT 100
In PostgreSQL, it works very well. For 10M records in DB, it just took less
than 20ms, but in SparkSQL, it took long time.
Michael,
Got it. For me, it is not good news. Anyway,
Data Set 1 : viEvents : Is the event activity data of 1 day. I took 10
files out of it and 10 records
*Item ID Count*
201335783004 3419 191568402102 1793 111654479898 1362 181503913062
1310 261798565828 1028 111654493548 994 231516683056 862
I assume you’re using the DataFrame API within your application.
sql(“SELECT…”).explain(true)
From: Wang, Daoyuan
Sent: Tuesday, May 5, 2015 10:16 AM
To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user
Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.
You can use
Starting the master with /sbin/start-master.sh creates a JVM with only
512MB of memory. How to change this default amount of memory?
Thanks,
Vijay
bq. its Spark libs are all at 2.10
Clarification: 2.10 is version of Scala
Your Spark version is 1.1.0
You can use earlier release of Kafka.
Cheers
On Mon, May 4, 2015 at 2:39 PM, Eric Ho eric...@intel.com wrote:
I still prefer to use Spark core / streaming /... at 2.10 becuase my DSE
is at
shoot me an email if you need any help with spark-sorted. it does not
(yet?) have a java api, so you will have to work in scala
On Mon, May 4, 2015 at 4:05 PM, Burak Yavuz brk...@gmail.com wrote:
I think this Spark Package may be what you're looking for!
Thanks, but is there non broadcast solution?
On 5 May 2015 01:34, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have implemented map-side join with broadcast variables and the code is
on mailing list (scala).
On Mon, May 4, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote:
Hi
Can
The JDBC interface for Spark SQL does not support pushing down limits today.
On Mon, May 4, 2015 at 8:06 AM, Robin East robin.e...@xense.co.uk wrote:
and a further question - have you tried running this query in pqsl? what’s
the performance like there?
On 4 May 2015, at 16:04, Robin East
If you data is evenly distributed (i.e. no skewed datapoints in your join
keys), it can also help to increase spark.sql.shuffle.partitions (default
is 200).
On Mon, May 4, 2015 at 8:03 AM, Richard Marscher rmarsc...@localytics.com
wrote:
In regards to the large GC pauses, assuming you allocated
Currently, SVMs don't have built-in multiclass support. Logistic
Regression supports multiclass, as do trees and random forests. It would
be great to add multiclass support for SVMs as well.
There is some ongoing work on generic multiclass-to-binary reductions:
Did you confirm through the Spark UI how much memory is getting allocated to
your application on each worker?
Mohammed
From: Vijayasarathy Kannan [mailto:kvi...@vt.edu]
Sent: Monday, May 4, 2015 3:36 PM
To: Andrew Ash
Cc: user@spark.apache.org
Subject: Re: Spark JVM default memory
I am trying
You are looking for LATERAL VIEW explode
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
in HiveQL.
On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco gibb...@gmail.com
wrote:
Hi, I'm trying to parse log files generated by Spark using
I am not able to access the web UI for some reason. But the logs (being
written while running my application) show that only 385Mb are being
allocated for each executor (or slave nodes) while the executor memory I
set is 16Gb. This 385Mb is not the same for each run either. It looks
random
I am trying to read in a file (4GB file). I tried setting both
spark.driver.memory and spark.executor.memory to large values (say
16GB) but I still get a GC limit exceeded error. Any idea what I am missing?
On Mon, May 4, 2015 at 5:30 PM, Andrew Ash and...@andrewash.com wrote:
It's unlikely you
oh wow, that is a really interesting observation, Marco Jerry.
I wonder if this is worth exposing in combineByKey()? I think Jerry's
proposed workaround is all you can do for now -- use reflection to
side-step the fact that the methods you need are private.
On Mon, Apr 27, 2015 at 8:07 AM,
yes, you should register all three.
The truth is, you only *need* to register classes that will get serialized
-- either via RDD caching or in a shuffle. So if a type is only used as an
intermediate inside a stage, you don't need to register it. But the
overhead of registering extra classes is
Looks like you're using Spark 1.1.0
Support for Kafka 0.8.2 was added by:
https://issues.apache.org/jira/browse/SPARK-2808
which would come in Spark 1.4.0
FYI
On Mon, May 4, 2015 at 12:22 PM, Eric Ho eric...@intel.com wrote:
I'm submitting this via 'dse spark-submit' but somehow, I don't see
It's unlikely you need to increase the amount of memory on your master node
since it does simple bookkeeping. The majority of the memory pressure
across a cluster is on executor nodes.
See the conf/spark-env.sh file for configuring heap sizes, and this section
in the docs for more information on
Hi Marius,
I am also a little confused -- are you saying that myPartitions is
basically something like:
class MyPartitioner extends Partitioner {
def numPartitions = 1
def getPartition(key: Any) = 0
}
??
If so, I don't understand how you'd ever end up data in two partitions.
Indeed, than
sortByKey() runs one job to sample the data, to determine what range of
keys to put in each partition.
There is a jira to change it to defer launching the job until the
subsequent action, but it will still execute another stage:
https://issues.apache.org/jira/browse/SPARK-1021
On Wed, Apr 29,
I think this Spark Package may be what you're looking for!
http://spark-packages.org/package/tresata/spark-sorted
Best,
Burak
On Mon, May 4, 2015 at 12:56 PM, Imran Rashid iras...@cloudera.com wrote:
oh wow, that is a really interesting observation, Marco Jerry.
I wonder if this is worth
Hi,
I'm trying to build the DAG of an application from the logs.
I've had a look at SparkReplayDebugger but it doesn't operato offline on
logs. I looked also at the one in this pull:
https://github.com/apache/spark/pull/2077 that seems to operate only on
logs but it doesn't clealry show the
Hi Sergio,
you shouldn't architecture it this way, rather update a storage with Spark
Streaming that your Play App will query.
For example a Cassandra table, or Redis, or anything that will be able to
answer you in milliseconds, rather than querying the Spark Streaming
program.
Regards,
Olivier.
Or, have you ever try broadcast join?
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20...@sina.com; Olivier Girardot; user
Subject: RE: 回复:Re: sparksql running slow while joining 2 tables.
Can you print out the physical plan?
EXPLAIN SELECT xxx…
As I know broadcastjoin is automatically enabled by
spark.sql.autoBroadcastJoinThreshold.refer to
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
and how to check my app's physical plan,and others things like optimized
plan,executable plan.etc
thanks
you may need to copy hive-site.xml to your spark conf directory and check your
hive metastore warehouse setting, and also check if you are authenticated to
access hive metastore warehouse.
Thanksamp;Best regards!
罗辉 San.Luo
- 原始邮件 -
发件人:鹰
95 matches
Mail list logo