SparkSQL optimizes better by column pruning and predicate pushdown,
primarily. Here you are not taking advantage of either.
I am curious to know what goes in your filter function, as you are not
using a filter in SQL side.
Best
Ayan
On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo
I think recommended use will be creating a dataframe using hbase as source.
Then you can run any SQL on that DF.
In 1.2 you can create a base rdd and then apply schema in the same manner
On 21 Apr 2015 03:12, Jeetendra Gangele gangele...@gmail.com wrote:
Thanks for reply.
Does phoenix using
You can always create another DF using a map. In reality operations are
lazy so only final value will get computed.
Can you provide the usecase in little more detail?
On 21 Apr 2015 08:39, ARose ashley.r...@telarix.com wrote:
In my Java application, I want to update the values of a Column in a
If you are using a pairrdd, then you can use partition by method to provide
your partitioner
On 21 Apr 2015 15:04, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
What is re-partition ?
On Tue, Apr 21, 2015 at 10:23 AM, ayan guha guha.a...@gmail.com wrote:
In my understanding you need to create
In my understanding you need to create a key out of the data and
repartition both datasets to achieve map side join.
On 21 Apr 2015 14:10, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Can someone share their working code of Map Side join in Spark + Scala.
(No Spark-SQL)
The only resource i could
You can use rdd.unpersist. its documented in spark programming guide page
under Removing Data section.
Ayan
On 21 Apr 2015 13:16, Wei Wei vivie...@gmail.com wrote:
Hey folks,
I am trying to load a directory of avro files like this in spark-shell:
val data =
-hadoop2.6\python\pyspark\mllib\recommendation.py,
line 127, in _prepare
assert isinstance(ratings, RDD), ratings should be RDD
AssertionError: ratings should be RDD
--
Best Regards,
Ayan Guha
.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*
--
Best Regards,
Ayan Guha
/Column-renaming-after-DataFrame-groupBy-tp22586.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Best Regards,
Ayan Guha
solution? I am thinking to map the training dataframe back to a RDD,
byt will lose the schema information.
Best
Ayan
On Mon, Apr 20, 2015 at 10:23 PM, ayan guha guha.a...@gmail.com wrote:
Hi
Just upgraded to Spark 1.3.1.
I am getting an warning
Warning (from warnings module):
File
D
What is the specific usecase? I can think of couple of ways (write to hdfs
and then read from spark or stream data to spark). Also I have seen people
using mysql jars to bring data in. Essentially you want to simulate
creation of rdd.
On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com
)
print newsY.count()
On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:
Hi
I am facing this weird issue.
I am on Windows, and I am trying to load all files within a folder. Here
is my code -
loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
newsY = sc.textFile(loc
that are currently available using API calls and
then take some appropriate action based on the information I get back, like
restart a dead Master or Worker.
Is this possible? does Spark provide such API?
--
Best Regards,
Ayan Guha
On Sun, Apr 26, 2015 at 10:12 AM, ayan guha guha.a...@gmail.com wrote:
In my limited understanding, there must be single leader master in
the cluster. If there are multiple leaders, it will lead to unstable
cluster as each masters will keep scheduling independently. You should use
zookeeper
#org.apache.spark.ml.recommendation.ALS
In the examples/ directory for ml/, you can find a MovieLensALS example.
Good luck!
Joseph
On Tue, Apr 21, 2015 at 4:58 AM, ayan guha guha.a...@gmail.com wrote:
Hi
I am getting an error
Also, I am getting an error in mlib.ALS.train function when passing
you!
Best,
Wenlei
--
Best Regards,
Ayan Guha
I just tested your pr
On 25 Apr 2015 10:18, Ali Bajwa ali.ba...@gmail.com wrote:
Any ideas on this? Any sample code to join 2 data frames on two columns?
Thanks
Ali
On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote:
Hi experts,
Sorry if this is a n00b question or has
you so much for the help!
On Sat, Apr 25, 2015 at 12:41 AM, ayan guha guha.a...@gmail.com wrote:
can you give an example set of data and desired output
On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie wenlei@gmail.com wrote:
Hi,
I would like to answer the following customized aggregation
that this is different than the Spark SQL
JDBC server, which allows other applications to run queries using Spark
SQL).
On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote:
What is the specific usecase? I can think of couple of ways (write to hdfs
and then read from spark or stream
,
Ayan Guha
:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
--
Best Regards,
Ayan
I do not think you can share data across spark contexts. So as long as you
can pass it around you should be good.
On 23 Apr 2015 17:12, Suraj Shetiya surajshet...@gmail.com wrote:
Hi,
I have come across ways of building pipeline of input/transform and output
pipelines with Java (Google
Quick questions: why are you cache both rdd and table?
Which stage of job is slow?
On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote:
Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:
public class Person implements Externalizable {
')
But in Spark 1.4.0 this does not seem to make any difference anyway and
the problem is the same with both versions.
On 2015-04-21 17:04, ayan guha wrote:
your code should be
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
Your current code is generating a tupple
Hi I replied you in SO. If option A had a action call then it should
suffice too.
On 28 Apr 2015 05:30, Eran Medan eran.me...@gmail.com wrote:
Hi Everyone!
I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing
something:
val
Can you show your code please?
On 28 Apr 2015 13:20, sranga sra...@gmail.com wrote:
Hi
I am getting the following error when persisting an RDD in parquet format
to
an S3 location. This is code that was working in the 1.2 version. The
version that it is failing to work is 1.3.1.
Any help is
Its a windows thing. Please escape front slash in string. Basically it is
not able to find the file
On 28 Apr 2015 22:09, Fabian Böhnlein fabian.boehnl...@gmail.com wrote:
Can you specifiy 'running via PyCharm'. how are you executing the script,
with spark-submit?
In PySpark I guess you used
Regards,
Ayan Guha
)
at java.lang.Thread.run(Thread.java:745)
Does filter work only on columns of the integer type? What is the exact
behaviour of the filter function and what is the best way to handle the
query I am trying to execute?
Thank you,
Francesco
--
Best Regards,
Ayan Guha
I guess what you mean is not streaming. If you create a stream context at
time t, you will receive data coming through starting time t++, not before
time t.
Looks like you want a queue. Let Kafka write to a queue, consume msgs from
the queue and stop when queue is empty.
On 29 Apr 2015 14:35,
Are your driver running on the same m/c as master?
On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote:
Hi,
I'm running short spark jobs on rdds cached in memory. I'm also using a
long running job context. I want to be able to complete my jobs (on the
cached rdd) in under 1 sec.
it using DataFrame? Can you
give an example code snipet?
Thanks
Ningjun
*From:* ayan guha [mailto:guha.a...@gmail.com]
*Sent:* Wednesday, April 29, 2015 5:54 PM
*To:* Wang, Ningjun (LNG-NPV)
*Cc:* user@spark.apache.org
*Subject:* Re: HOw can I merge multiple DataFrame and remove duplicated key
Its no different, you would use group by and aggregate function to do so.
On 30 Apr 2015 02:15, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com
wrote:
I have multiple DataFrame objects each stored in a parquet file. The
DataFrame just contains 3 columns (id, value, timeStamp). I need to
The answer is it depends :)
The fact that query runtime increases indicates more shuffle. You may want
to construct rdds based on keys you use.
You may want to specify what kind of node you are using and how many
executors you are using. You may also want to play around with executor
memory
Spark keeps job in memory by default for kind of performance gains you are
seeing. Additionally depending on your query spark runs stages and any
point of time spark's code behind the scene may issue explicit cache. If
you hit any such scenario you will find those cached objects in UI under
Hi
Can you test on a smaller dataset to identify if it is cluster issue or
scaling issue in spark
On 28 Apr 2015 11:30, Ulanov, Alexander alexander.ula...@hp.com wrote:
Hi,
I am running a group by on a dataset of 2B of RDD[Row [id, time, value]]
in Spark 1.3 as follows:
“select id,
Alias function not in python yet. I suggest to write SQL if your data suits
it
On 28 Apr 2015 14:42, Don Drake dondr...@gmail.com wrote:
https://issues.apache.org/jira/browse/SPARK-7182
Can anyone suggest a workaround for the above issue?
Thanks.
-Don
--
Donald Drake
Drake Consulting
Yes it is possible. You need to use jsonfile method on SQL context and then
create a dataframe from the rdd. Then register it as a table. Should be 3
lines of code, thanks to spark.
You may see few YouTube video esp for unifying pipelines.
On 3 May 2015 19:02, Jai jai4l...@gmail.com wrote:
Hi,
You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:
I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.
I'm
Hi
How do you figure out 500gig~3900 partitions? I am trying to do the math.
If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we
assume split and block sizes are same, shouldn't we end up with 8k
partitions?
On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote:
?
Thanks,
Lior
--
Best Regards,
Ayan Guha
...@spark.apache.org
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
path?
b) How can I do partitionby? Specifically, when I call DF.rdd.partitionBy,
what gets passed to the custom function? tuple? row? how to access (say 3rd
column of a tuple inside partitioner function)?
--
Best Regards,
Ayan Guha
You have rdd or dataframe? Rdds are kind of tuples. You can add a new
column to it by a map.
rdd s are immutable, so you will get another rdd.
On 1 May 2015 14:59, Carter gyz...@hotmail.com wrote:
Hi all,
I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more
column at the
And if I may ask, how long it takes in hbase CLI? I would not expect spark
to improve performance of hbase. At best spark will push down the filter
to hbase. So I would try to optimise any additional overhead like bringing
data into spark.
On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote:
PM ayan guha guha.a...@gmail.com wrote:
Looks like you DF is based on a MySQL DB using jdbc, and error is thrown
from mySQL. Can you see what SQL is finally getting fired in MySQL? Spark
is pushing down the predicate to mysql so its not a spark problem perse
On Wed, Apr 29, 2015 at 9:56 PM
This is my first thought, please suggest any further improvement:
1. Create a rdd of your dataset
2. Do an cross join to generate pairs
3. Apply reducebykey and compute distance. You will get a rdd with keypairs
and distance
Best
Ayan
On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl
commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
--
Best Regards,
Ayan Guha
And how important is to have production environment?
On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:
There are questions in all three languages.
2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:
I too have similar question.
My understanding is since Spark
.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22768.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Best Regards,
Ayan Guha
Also, if not already done, you may want to try repartition your data to 50
partition s
On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote:
Hi All,
For a job I am running on Spark with a dataset of say 350,000 lines (not
big), I am finding that even though my cluster has a large
it to a tuple2 seems like a waste of space/computation.
It looks like the PairRDDFunctions..partitionBy() uses a ShuffleRDD[K,V,C]
requires K,V,C? Could I create a new
ShuffleRDD[MyClass,MyClass,MyClass](caseClassRdd, new HashParitioner)?
Cheers,
N
--
Best Regards,
Ayan Guha
Every transformation on a dstream will create another dstream. You may want
to take a look at foreachrdd? Also, kindly share your code so people can
help better
On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:
Please help guys, Even After going through all the examples given i
for Spark certification, learning in group makes learning easy and fun.
Kartik
On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote:
And how important is to have production environment?
On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:
There are questions in all three languages
.
Is the above understanding correct? or is there more to it?
--
Best Regards,
Ayan Guha
be forced.
Any ideas?
--
Best Regards,
Ayan Guha
What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:
I have searched all replies to this question not found an answer.
I am running standalone Spark 1.3.1 and
From S3. As the dependency of df will be on s3. And because rdds are not
replicated.
On 8 May 2015 23:02, Peter Rudenko petro.rude...@gmail.com wrote:
Hi, i have a next question:
val data = sc.textFile(s3:///)val df = data.toDF
df.saveAsParquetFile(hdfs://)
df.someAction(...)
if during
Try this
Res = ssc.sql(your SQL without limit)
Print red.first()
Note: your SQL looks wrong as count will need a group by clause.
Best
Ayan
On 11 May 2015 16:22, Tyler Mitchell tyler.mitch...@actian.com wrote:
I'm using Python to setup a dataframe, but for some reason it is not
being made
Typically you would use . notation to access, same way you would access a
map.
On 12 May 2015 00:06, Ashish Kumar Singh ashish23...@gmail.com wrote:
Hi ,
I am trying to read Nested Avro data in Spark 1.3 using DataFrames.
I need help to retrieve the Inner element data in the Structure below.
It depends on how you want to run your application. You can always save 100
batch as a data file and run another app to read those files. In that case
you have separated contexts and you will find both application running
simultaneously in the cluster but on different JVMs. But if you do not want
Thanks, but is there non broadcast solution?
On 5 May 2015 01:34, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have implemented map-side join with broadcast variables and the code is
on mailing list (scala).
On Mon, May 4, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote:
Hi
Can
with a given predicate to
implement this ? (I would probably also need to provide a partitioner, and
some sorting predicate).
Left and right RDD are 1-10 millions lines long.
Any idea ?
Thanks
Mathieu
--
Best Regards,
Ayan Guha
How did you end up with thousands of df? Are you using streaming? In that
case you can do foreachRDD and keep merging incoming rdds to single rdd and
then save it through your own checkpoint mechanism.
If not, please share your use case.
On 11 May 2015 00:38, Peter Aberline
file. They have the same schema.
There is also the option of appending each DF to the parquet file, but
then I can't maintain them as separate DF when reading back in without
filtering.
I'll rethink maintaining each CSV file as a single DF.
Thanks,
Peter
On 10 May 2015 at 15:51, ayan guha
--
Best Regards,
Ayan Guha
I am just wondering if create table supports the syntax of
Create table dB.tablename
Instead of two step process of use dB and then create table tablename?
On 9 May 2015 08:17, Michael Armbrust mich...@databricks.com wrote:
Actually, I was talking about the support for inferring different but
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter
On 9 May 2015 05:19, anshu shukla anshushuk...@gmail.com wrote:
Any update to above mail
and Can anyone tell me logic - I have to filter tweets and submit tweets
with particular #hashtag1 to SparkSQL databases and tweets with
()
thx,
Antony.
--
Best Regards,
Ayan Guha
-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
Here is from documentation:
Spark SQL is designed to be compatible with the Hive Metastore, SerDes and
UDFs. Currently Spark SQL is based on Hive 0.12.0 and 0.13.1.
On Sun, May 17, 2015 at 1:48 AM, ayan guha guha.a...@gmail.com wrote:
Hi
Try with Hive 0.13. If I am not wrong, Hive 0.14
the performance.
Thanks.
Justin
On Fri, May 15, 2015 at 6:32 AM, ayan guha guha.a...@gmail.com wrote:
can you kindly elaborate on this? it should be possible to write udafs in
similar lines of sum/min etc.
On Fri, May 15, 2015 at 5:49 AM, Justin Yip yipjus...@prediction.io
wrote:
Hello,
May I
:
*2553: 0,0,0,1,0,1,0,0*
46551: 0,1,0,0,0,0,0,0
266: 0,1,0,0,0,0,0,0
*225546: 0,0,0,0,0,2,0,0*
Anyone can help me getting that?
Thank you.
Have a nice day.
yasemin
--
hiç ender hiç
--
Best Regards,
Ayan Guha
)
lines.count()
On Thu, May 14, 2015 at 4:17 AM, ayan guha guha.a...@gmail.com wrote:
Jo
Thanks for the reply, but _jsc does not have anything to pass hadoop
configs. can you illustrate your answer a bit more? TIA...
On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha sriharsha@gmail.com
wrote
With this information it is hard to predict. What's the performance you are
getting? What's your desired performance? Maybe you can post your code and
experts can suggests improvement?
On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote:
Hi Friends,
please someone can give the
-Function-for-DataFrame-tp22893.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Best Regards,
Ayan Guha
...@gmail.com
wrote:
I understated that this port value is randomly selected.
Is there a way to enforce which spark port a Worker should use?
--
Best Regards,
Ayan Guha
batches, I would need to handle update in case the hdfs
directory already exists.
Is this a common approach? Are there any other approaches that I can try?
Thank you!
Nisrina.
--
Best Regards,
Ayan Guha
at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
(jsc)
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
through which you can access the hadoop configuration
On Tue, May 12, 2015 at 6:39 AM, ayan guha guha.a...@gmail.com wrote:
Hi
I found this method in scala API but not in python API (1.3.1).
Basically, I
My first thought would be creating 10 rdds and run your word count on each
of them..I think spark scheduler is going to resolve dependency in parallel
and launch 10 jobs.
Best
Ayan
On 18 May 2015 23:41, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote:
Hi,
Consider I have a tab delimited text
Hi
So to be clear, do you want to run one operation in multiple threads within
a function or you want run multiple jobs using multiple threads? I am
wondering why python thread module can't be used? Or you have already gave
it a try?
On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in
Your stack trace says it can't convert date to integer. You sure about
column positions?
On 13 May 2015 21:32, Ishwardeep Singh ishwardeep.si...@impetus.co.in
wrote:
Hi ,
I am using Spark SQL 1.3.1.
I have created a dataFrame using jdbc data source and am using
saveAsTable()
method but got
the seed (call
random.seed()) once on each worker?
--
*From:* ayan guha guha.a...@gmail.com
*Sent:* Tuesday, May 12, 2015 11:17 PM
*To:* Charles Hayden
*Cc:* user
*Subject:* Re: how to set random seed
Easiest way is to broadcast it.
On 13 May 2015 10:40, Charles
and create a logical plan. Even if i have
just one row, it's taking more than 1 hour just to get pass the parsing.
Any idea how to optimize in these kind of scenarios?
Regards,
Madhukara Phatak
http://datamantra.io/
--
Best Regards,
Ayan Guha
Easiest way is to broadcast it.
On 13 May 2015 10:40, Charles Hayden charles.hay...@atigeo.com wrote:
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs
will produce the same shuffle.
I am looking
, how?
--
Best Regards,
Ayan Guha
the schema, I am specifying every field
as nullable. So I believe, it should not throw this error. Can anyone help
me fix this error. Thank you.
Regards,
Anand.C
--
Best Regards,
Ayan Guha
What is your spark env file says? Are you setting number of executors in
spark context?
On 20 May 2015 13:16, Shailesh Birari sbirar...@gmail.com wrote:
Hi,
I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB
of RAM.
I have around 600,000+ Json files on HDFS. Each file
And if I am not wrong, spark SQL api is intended to move closer to SQL
standards. I feel its a clever decision on spark's part to keep both APIs
operational. These short term confusions worth the long term benefits.
On 20 May 2015 17:19, Sean Owen so...@cloudera.com wrote:
I don't think that's
are you using
发自我的 iPhone
在 2015年5月19日,18:29,ayan guha guha.a...@gmail.com 写道:
can you kindly share your code?
On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com
wrote:
Hi,
I am trying run spark sql aggregation on a file with 26k columns. No
of rows is very small. I am
Thanks a bunch
On 21 May 2015 07:11, Davies Liu dav...@databricks.com wrote:
The docs had been updated.
You should convert the DataFrame to RDD by `df.rdd`
On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote:
Hi
Just upgraded to Spark 1.3.1.
I am getting an warning
Try with large number of partition in parallelize.
On 4 Jun 2015 06:28, Justin Spargur jmspar...@gmail.com wrote:
Hi all,
I'm playing around with manipulating images via Python and want to
utilize Spark for scalability. That said, I'm just learing Spark and my
Python is a bit rusty
Another option is merge partfiles after your app ends.
On 5 Jun 2015 20:37, Akhil Das ak...@sigmoidanalytics.com wrote:
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be
efficient if your output data is huge since one task will be doing the
whole writing.
Thanks
Best
is the better
solution?
Best Regards
Marcos
--
Best Regards,
Ayan Guha
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
, ect.. ) in order to have the cluster up an running after boot-up;
although I'd like to understand if it will cause more issues than it
solves.
Thanks, Mike.
--
Best Regards,
Ayan Guha
operations like
join, groupBy, agg, unionAll etc which are all transformations in RDD? Are
they lazily evaluated or immediately executed?
--
Best Regards,
Ayan Guha
1 - 100 of 709 matches
Mail list logo