en to
disk you may see duplicates when there are failures. However, if you read
the output location with Spark you should get exactly once results (unless
there is a bug) since spark will know how to use the commit log to see what
data files are committed and not.
Best,
Jerry
On Mon, Sep 18, 20
Craig,
Thanks! Please let us know the result!
Best,
Jerry
On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh
wrote:
>
> Hi Craig,
>
> Can you please clarify what this bug is and provide sample code causing
> this issue?
>
> HTH
>
> Mich Talebzadeh,
> Disting
across a test suite. In general I try to keep my fixtures to one
concrete task only, so that if I find myself repeating a pattern I just
factor it out into another fixture.
On Tue, Feb 9, 2021 at 11:14 AM Mich Talebzadeh
wrote:
> Thanks Jerry for your comments.
>
> The easiest option and
thin other
fixtures (which are also just functions), since the result of the fixture
call is just some Python object.
Hope this helps,
Jerry
On Tue, Feb 9, 2021 at 10:18 AM Mich Talebzadeh
wrote:
> I was a bit confused with the use of fixtures in Pytest with the
> dataframes passed as a
oyee_name
> now the http GET call has to be made for each employee_id and DataFrame is
> dynamic for each spark job run.
>
> Does it make sense?
>
> Thanks
>
>
> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov
> wrote:
>
>> Hi Chetan,
>>
>> You
all the
workers to have access to whatever you're getting from the API, that's the
way to do it.
Jerry
On Thu, May 14, 2020 at 5:03 PM Chetan Khatri
wrote:
> Hi Spark Users,
>
> How can I invoke the Rest API call from Spark Code which is not only
> running on Spark Driver but
This seems like a suboptimal situation for a join. How can Spark know in
advance that all the fields are present and the tables have the same number
of rows? I suppose you could just sort the two frames by id and concatenate
them, but I'm not sure what join optimization is available here.
On Fri,
our tolerance for error you could also use
>> percentile_approx().
>>
>> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov
>> wrote:
>>
>>> Do you mean that you are trying to compute the percent rank of some
>>> data? You can use the SparkSQL percent_r
Do you mean that you are trying to compute the percent rank of some data?
You can use the SparkSQL percent_rank function for that, but I don't think
that's going to give you any improvement over calling the percentRank
function on the data frame. Are you currently using a user-defined function
for
properly.
Vadim's suggestions did not make a difference for me (still hitting this
error several times a day) but I'll try with disabling broadcast and see if
that does anything.
thanks,
Jerry
On Fri, Sep 20, 2019 at 10:00 AM Julien Laurenceau <
julien.laurenc...@pepitedata.com> wrote:
> Hi,
ize but... who knows). I will try it
with your suggestions and see if it solves the problem.
thanks,
Jerry
On Tue, Sep 17, 2019 at 4:34 PM Vadim Semenov wrote:
> Pre-register your classes:
>
> ```
> import com.esotericsoftware.kryo.Kryo
> import org.apache.spark.serializer.Kryo
isc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scal
e ?
Best,
Jerry
Maybe I'm not understanding something about this use case, but why is
precomputation not an option? Is it because the matrices themselves change?
Because if the matrices are constant, then I think precomputation would
work for you even if the users request random correlations. You can just
store
nder which
you'd like to register that table. You can then use the table in SQL
statements. As far as I know, you cannot directly refer to any external
data store without reading it in first.
Jerry
On Thu, Jul 11, 2019 at 1:27 PM infa elance wrote:
> Sorry, i guess i hit the send bu
he extraClasspath config settings on
the driver and executor, but that has not fixed the problem. I'm out of
ideas for how to tackle this and would love to hear if anyone has any
suggestions or strategies for fixing this.
thanks,
Jerry
--
http://www.google.com/profiles/grapesmoker
Hi Koert,
Thank you for your help! GOT IT!
Best Regards,
Jerry
On Wed, Feb 1, 2017 at 6:24 PM, Koert Kuipers <ko...@tresata.com> wrote:
> you can still use it as Dataset[Set[X]]. all transformations should work
> correctly.
>
> however dataset.schema will show binary type
as the input type did some maths on the
product_id and outputs a set of product_ids.
Best Regards,
Jerry
Hi Koert,
Thanks for the tips. I tried to do that but the column's type is now
Binary. Do I get the Set[X] back in the Dataset?
Best Regards,
Jerry
On Tue, Jan 31, 2017 at 8:04 PM, Koert Kuipers <ko...@tresata.com> wrote:
> set is currently not supported. you can use kry
lass which is the output type
for the aggregation.
Is there a workaround for this?
Best Regards,
Jerry
numExplicits += 1
ls.add(srcFactor, (c1 + 1.0) / c1, c1)
}
} else {
ls.add(srcFactor, rating)
numExplicits += 1
}
{code}
Regards,
Jerry
On Mon, Dec 5, 2016 at 3:27 PM, Sean Owen <so...@cloudera.com> wrote:
&
air. (x^Ty)^2 + regularization.
Do I misunderstand the paper?
Best Regards,
Jerry
On Mon, Dec 5, 2016 at 2:43 PM, Sean Owen <so...@cloudera.com> wrote:
> What are you referring to in what paper? implicit input would never
> materialize 0s for missing values.
>
> On Tue, Dec
.
Best Regards,
Jerry
sql("select firstName as first_name, middleName as middle_name,
lastName as last_name from jsonTable)
But there are an error
org.apache.spark.sql.AnalysisException: cannot resolve 'middleName' given
input columns firstName, lastName;
Can anybody give me your wisdom or any suggestions?
Thanks!
Jerry
files like,
val row = sqlContext.sql("SELECT firstname, middlename, lastname,
address.state, address.city FROM jsontable")
The compile will tell me the error of line1: no "middlename".
How do I handle this case in the SQL sql?
Many thanks in advance!
Jerry
Hi David,
Thank you for your response.
Before inserting to Cassandra, I had checked the data have already missed
at HDFS (My second step is to load data from HDFS and then insert to
Cassandra).
Can you send me the link relating this bug of 0.8.2?
Thank you!
Jerry
On Thu, May 5, 2016 at 12:38
and confirmed the
same number in the Broker. But when I checked either HDFS or Cassandra, the
number is just 363. The data is not always lost, just sometimes... That's
wired and annoying to me.
Can anybody give me some reasons?
Thanks!
Jerry
--
View this message in context:
http://apache-spark-user
Hi spark users and developers,
Anyone tried to pass in an Array[Double] as a input to the UDF? I tried it
for many hours reading spark sql code but IK still couldn't figure out a
way to do this.
Best Regards,
Jerry
answer
though.
As I said, this is just a tip of iceberg. I have experience worsen than
this. For example, you might think renaming fields will work but in some
cases, it still returns wrong results.
Best Regards,
Jerry
On Tue, Mar 29, 2016 at 7:38 AM, Jerry Lam <chiling...@gmail.
Hi Divya,
This is not a self-join. d1 and d2 contain totally different rows. They are
derived from the same table. The transformation that are applied to
generate d1 and d2 should be able to disambiguate the labels in the
question.
Best Regards,
Jerry
On Tue, Mar 29, 2016 at 2:43 AM, Divya
. There are
other bugs I found these three days that are associated with this type of
joins. In one case, if I don't drop the duplicate column BEFORE the join,
spark has preferences on the columns from d2 dataframe. I will see if I can
replicate in a small program like above.
Best Regards,
Jerry
On Mon, Mar
").drop(d2("label")).select(d1("label"))
The above code will throw an exception saying the column label is not
found. Do you have a reason for throwing an exception when the column has
not been dropped for d1("label")?
Best Regards,
Jerry
that do that, please share
your findings!
Thank you,
Jerry
.
The json messages are coming from Kafka consumer. It's over 1,500 messages
per second. So the message processing (parser and write to Cassandra) is
also need to be completed at the same time (1,500/second).
Thanks in advance.
Jerry
I appreciate it if you can give me any helps and advice.
I've misunderstood.
Best Regards,
Jerry
Sent from my iPhone
> On 19 Feb, 2016, at 10:20 am, Sebastian Piu <sebastian@gmail.com> wrote:
>
> I don't have the code with me now, and I ended moving everything to RDD in
> the end and using map operations to do some loo
Rado,
Yes. you are correct. A lots of messages are created almost in the same
time (even use milliseconds). I changed to use "UUID.randomUUID()" with
which all messages can be inserted in the Cassandra table without time lag.
Thank you very much!
Jerry Wong
On Wed, Feb 17, 2016
time)
But the Cassandra can only be inserted about 100 messages in each round of
test.
Can anybody give me advices why the other messages (about 900 message) can't
be consumed?
How do I configure and tune the parameters in order to improve the
throughput of consumers?
Thank you very much fo
Not sure if I understand your problem well but why don't you create the file
locally and then upload to hdfs?
Sent from my iPhone
> On 12 Feb, 2016, at 9:09 am, "seb.arzt" wrote:
>
> I have an Iterator of several million elements, which unfortunately won't fit
> into the
what I fine tune most. Making sure the task/core has
enough memory to execute to completion. Some times you really don't know
how much data you keep in memory until you profile your application.
(calculate some statistics help).
Best Regards,
Jerry
On Wed, Feb 3, 2016 at 4:58 PM, Nirav Pate
I think spark dataframe supports more than just SQL. It is more like pandas
dataframe.( I rarely use the SQL feature. )
There are a lot of novelties in dataframe so I think it is quite optimize
for many tasks. The in-memory data structure is very memory efficient. I
just change a very slow RDD
Hi Michael,
Is there a section in the spark documentation demonstrate how to serialize
arbitrary objects in Dataframe? The last time I did was using some User
Defined Type (copy from VectorUDT).
Best Regards,
Jerry
On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mich...@databricks.com>
to the same output text file?
Thank you!
Jerry
y why you think it is the
case? Relative to what other ways?
Best Regards,
Jerry
On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
> I dont understand why one thinks RDD of case object doesn't have
> types(schema) ? If spark can convert RDD to DataFrame
Hi Mao,
Can you try --jars to include those jars?
Best Regards,
Jerry
Sent from my iPhone
> On 26 Jan, 2016, at 7:02 pm, Mao Geng <m...@sumologic.com> wrote:
>
> Hi there,
>
> I am trying to run Spark on Mesos using a Docker image as executor, as
> mentioned
&
Is cacheTable similar to asTempTable before?
Sent from my iPhone
> On 19 Jan, 2016, at 4:18 am, George Sigletos wrote:
>
> Thanks Kevin for your reply.
>
> I was suspecting the same thing as well, although it still does not make much
> sense to me why would you need
Hi spark users and developers,
what do you do if you want the from_unixtime function in spark sql to
return the timezone you want instead of the system timezone?
Best Regards,
Jerry
if this is the only way out of the box.
Thanks!
Jerry
On Mon, Jan 18, 2016 at 2:32 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:
> Look at
> to_utc_timestamp
>
> from_utc_timestamp
> On Jan 18, 2016 9:39 AM, "Jerry Lam" <chiling...@gmail.com> wrot
a partitioned table because it takes very
long (over hours on s3) to execute the
sqlcontext.read.parquet("partitioned_table").
Best Regards,
Jerry
Sent from my iPhone
> On 15 Jan, 2016, at 3:59 pm, Michael Armbrust <mich...@databricks.com> wrote:
>
> See here for s
Can you save it to parquet with the vector in one field?
Sent from my iPhone
> On 15 Jan, 2016, at 7:33 pm, Andy Davidson
> wrote:
>
> Are you using 1.6.0 or an older version?
>
> I think I remember something in 1.5.1 saying save was not implemented in
>
of memory and very
long time to read the table back.
Best Regards,
Jerry
On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <arkadiusz.b...@gmail.com>
wrote:
> Hi
>
> What is the proper configuration for saving parquet partition with
> large number of repeated keys?
>
> On bello
ate
_temporary files and then it moved the files under the _temporary to the
output directory.
Is this behavior expected? Or is it a bug?
I'm using Spark 1.5.2.
Best Regards,
Jerry
Hi Michael,
Thanks for the hint! So if I turn off speculation, consecutive appends like
above will not produce temporary files right?
Which class is responsible for disabling the use of DirectOutputCommitter?
Thank you,
Jerry
On Tue, Jan 12, 2016 at 4:12 PM, Michael Armbrust <m
Hi Kostiantyn,
Yes. If security is a concern then this approach cannot satisfy it. The keys
are visible in the properties files. If the goal is to hide them, you might be
able go a bit further with this approach. Have you look at spark security page?
Best Regards,
Jerry
Sent from my iPhone
is right. I'm using my phone
so I cannot easily verifying.
Then you can specify different user using different spark.conf via
--properties-file when spark-submit
HTH,
Jerry
Sent from my iPhone
> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev
> <kudryavtsev.konstan...@gmail.c
Hi Kostiantyn,
Can you define those properties in hdfs-site.xml and make sure it is visible in
the class path when you spark-submit? It looks like a conf sourcing issue to
me.
Cheers,
Sent from my iPhone
> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev
>
Hi Kostiantyn,
I want to confirm that it works first by using hdfs-site.xml. If yes, you could
define different spark-{user-x}.conf and source them during spark-submit. let
us know if hdfs-site.xml works first. It should.
Best Regards,
Jerry
Sent from my iPhone
> On 30 Dec, 2015, at 2:31
Best Regards,
Jerry
> On Dec 15, 2015, at 5:18 PM, Jakob Odersky <joder...@gmail.com> wrote:
>
> Hi Veljko,
> I would assume keeping the number of executors per machine to a minimum is
> best for performance (as long as you consider memory requirements as well).
> Each e
to do that without manual process.
Best Regards,
Jerry
On Wed, Dec 2, 2015 at 1:02 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:
> Do you think it's a security issue if EMR started in VPC with a subnet
> having Auto-assign Public IP: Yes
>
> you can remove all Inbound ru
Hi Ted,
That looks exactly what happens. It has been 5 hrs now. The code was built for
1.4. Thank you very much!
Best Regards,
Jerry
Sent from my iPhone
> On 14 Nov, 2015, at 11:21 pm, Ted Yu <yuzhih...@gmail.com> wrote:
>
> Which release are you using ?
> If older th
spark.sql.hive.enabled false configuration would be lovely too. :)
Just an additional bonus is that it requires less memory if we don’t use
HiveContext on the driver side (~100-200MB) from a rough observation.
Thanks and have a nice weekend!
Jerry
> On Nov 6, 2015, at 5:53 PM, Ted Yu <yuzhih...@gma
. /home/jerry directory). It will give me an exception
like below.
Since I don’t use HiveContext, I don’t see the need to maintain a database.
What is interesting is that pyspark shell is able to start more than 1 session
at the same time. I wonder what pyspark has done better than spark-shell
)
at org.apache.derby.jdbc.Driver20.connect(Unknown Source)
at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
Best Regards,
Jerry
> On Nov 6, 2015, at 12:12 PM, Ted Yu <yuzhih...@gmail.com&
ply$mcZ$sp(SparkILoopExt.scala:127)
at
org.apache.spark.repl.SparkILoopExt$$anonfun$process$1.apply(SparkILoopExt.scala:113)
at
org.apache.spark.repl.SparkILoopExt$$anonfun$process$1.apply(SparkILoopExt.scala:113)
Best Regards,
Jerry
onfig of skipping the above call.
>
> FYI
>
> On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <chiling...@gmail.com
> <mailto:chiling...@gmail.com>> wrote:
> Hi spark users and developers,
>
> Is it possible to disable HiveContext from being instantiated when usin
Does Qubole use Yarn or Mesos for resource management?
Sent from my iPhone
> On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan
> wrote:
>
> Qubole
-
To unsubscribe, e-mail:
We "used" Spark on Mesos to build interactive data analysis platform
because the interactive session could be long and might not use Spark for
the entire session. It is very wasteful of resources if we used the
coarse-grained mode because it keeps resource for the entire session.
Therefore,
r. the max-date is likely
> to be faster though.
>
> On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Koert,
>>
>> You should be able to see if it requires scanning the whole data by
>> "explain" the query. The physica
Hi Koert,
You should be able to see if it requires scanning the whole data by
"explain" the query. The physical plan should say something about it. I
wonder if you are trying the distinct-sort-by-limit approach or the
max-date approach?
Best Regards,
Jerry
On Sun, Nov 1, 2015
of the physical plan, you can navigate the actual
execution in the web UI to see how much data is actually read to satisfy
this request. I hope it only requires a few bytes for few dates.
Best Regards,
Jerry
On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam <chiling...@gmail.com> wrote:
> I agreed the
s actually works or
not. :)
Best Regards,
Jerry
On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com> wrote:
> hello all,
> i am trying to get familiar with spark sql partitioning support.
>
> my data is partitioned by date, so like this:
> data/date=2015-01-01
>
I used the spark 1.3.1 to populate the event logs to Cassandra. But there
is an exception that I could not find out any clauses. Can anybody give me
any helps?
Exception in thread "main" java.lang.IllegalArgumentException: Positive
number of slices required
at
Hi Bryan,
Did you read the email I sent few days ago. There are more issues with
partitionBy down the road:
https://www.mail-archive.com/user@spark.apache.org/msg39512.html
<https://www.mail-archive.com/user@spark.apache.org/msg39512.html>
Best Regards,
Jerry
> On Oct 28, 2015, a
. it takes awhile to
initialize the partition table and it requires a lot of memory from the driver.
I would not use it if the number of partition go over a few hundreds.
Hope this help,
Jerry
Sent from my iPhone
> On 28 Oct, 2015, at 6:33 pm, Bryan <bryan.jeff...@gmail.com> wrote:
&
.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any idea why it can read the schema from the parquet file but not
processing the file? It feels like the hadoop configuration is not sent to
the executor for some reasons...
Thanks,
Jerry
oad the parquet file but I cannot perform a count on the parquet file
because of the AmazonClientException. It means that the credential is used
during the loading of the parquet but not when we are processing the
parquet file. How this can happen?
Best Regards,
Jerry
On Tue, Oct 27, 2015 at 2:05 PM,
t;key",
"value") does not propagate through all SQL jobs within the same
SparkContext? I haven't try with Spark Core so I cannot tell.
Is there a workaround given it seems to be broken? I need to do this
programmatically after the SparkContext is instantiated not before...
Best Regards,
J
of partition is over 100.
Best Regards,
Jerry
Sent from my iPhone
> On 26 Oct, 2015, at 2:50 am, Fengdong Yu <fengdo...@everstring.com> wrote:
>
> How many partitions you generated?
> if Millions generated, then there is a huge memory consumed.
>
>
>
>
>
>&
mory which is
a bit odd in my opinion. Any help will be greatly appreciated.
Best Regards,
Jerry
On Sun, Oct 25, 2015 at 9:25 PM, Josh Rosen <rosenvi...@gmail.com> wrote:
> Hi Jerry,
>
> Do you have speculation enabled? A write which produces one million files
> / output pa
)
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31)
org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395)
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267)
On Sun, Oct 25, 2015 at 10:25 PM, Jerry Lam <chiling...@gmail.com> wrote:
> Hi Josh,
>
>
parameters to make it more memory efficient?
Best Regards,
Jerry
On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam <chiling...@gmail.com> wrote:
> Hi guys,
>
> After waiting for a day, it actually causes OOM on the spark driver. I
> configure the driver to have 6GB. Note that I didn't c
million
files. Not sure why it OOM the driver after the job is marked _SUCCESS in
the output folder.
Best Regards,
Jerry
On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam <chiling...@gmail.com> wrote:
> Hi Spark users and developers,
>
> Does anyone encounter any issue when a spark SQL job
?
Best Regards,
Jerry
,LongType,true),
StructField(type,StringType,true))
As you can see the schema does not match. The nullable field is set to true
for timestamp upon reading the dataframe back. Is there a way to preserve
the schema so that what we write to will be what we read back?
Best Regards,
Jerry
Can you try setting SPARK_USER at the driver? It is used to impersonate users
at the executor. So if you have user setup for launching spark jobs on the
executor machines, simply set it to that user name for SPARK_USER. There is
another configuration that will prevents jobs being launched with
I'm interested in it but I doubt there is r-tree indexing support in the near
future as spark is not a database. You might have a better luck looking at
databases with spatial indexing support out of the box.
Cheers
Sent from my iPad
On 2015-10-18, at 17:16, Mustafa Elbehery
I just read the article by ogirardot but I don’t agree
It is like saying pandas dataframe is the sole data structure for analyzing
data in python. Can Pandas dataframe replace Numpy array? The answer is simply
no from an efficiency perspective for some computations.
Unless there is a computer
This is the ticket SPARK-10951
<https://issues.apache.org/jira/browse/SPARK-10951>
Cheers~
On Tue, Oct 6, 2015 at 11:33 AM, Jerry Lam <chiling...@gmail.com> wrote:
> Hi Burak,
>
> Thank you for the tip.
> Unfortunately it does not work. It throws:
>
> java.net.
support s3 repo. I will file a jira ticket for this.
Best Regards,
Jerry
On Sat, Oct 3, 2015 at 12:50 PM, Burak Yavuz <brk...@gmail.com> wrote:
> Hi Jerry,
>
> The --packages feature doesn't support private repositories right now.
> However, in the case of s3, maybe it might work
Philip, the guy is trying to help you. Calling him silly is a bit too far. He
might assume your problem is IO bound which might not be the case. If you need
only 4 cores per job no matter what there is little advantage to use spark in
my opinion because you can easily do this with just a worker
?
Thank you!
Jerry
Hi Michael and Ted,
Thank you for the reference. Is it true that once I implement a custom data
source, it can be used in all spark supported language? That is Scala,
Java, Python and R. :)
I want to take advantage of the interoperability that is already built in
spark.
Thanks!
Jerry
On Tue
Hi spark users and developers,
I'm trying to learn how implement a custom data source for Spark SQL. Is
there a documentation that I can use as a reference? I'm not sure exactly
what needs to be extended/implemented. A general workflow will be greatly
helpful!
Best Regards,
Jerry
dataframe?
Best Regards,
Jerry
On Fri, Sep 25, 2015 at 7:53 PM, Michael Armbrust <mich...@databricks.com>
wrote:
> The SQL parser without HiveContext is really simple, which is why I
> generally recommend users use HiveContext. However, you can do it with
> datafra
'' expected but identifier view found
with the query look like:
"select items from purhcases lateral view explode(purchase_items) tbl as
items"
Best Regards,
Jerry
Do you have specific reasons to use Ceph? I used Ceph before, I'm not too
in love with it especially when I was using the Ceph Object Gateway S3 API.
There are some incompatibilities with aws s3 api. You really really need to
try it because making the commitment. Did you managed to install it?
On
is that
the architecture and the performance will be similar to S3+Spark at best
(with 10GE instances) if you guys do the network stuff seriously.
HTH,
Jerry
On Tue, Sep 22, 2015 at 9:59 PM, fightf...@163.com <fightf...@163.com>
wrote:
> Hi Jerry
>
> Yeah, we managed to run and u
Hi Amit,
Have you looked at Amazon EMR? Most people using EMR use s3 for persistency
(both as input and output of spark jobs).
Best Regards,
Jerry
Sent from my iPhone
> On 21 Sep, 2015, at 9:24 pm, Amit Ramesh <a...@yelp.com> wrote:
>
>
> A lot of places in the documenta
uch faster. Not to mention that if I do:
df.rdd.take(1) //runs much faster.
Is this expected? Why head/first/take is so slow for dataframe? Is it a bug
in the optimizer? or I did something wrong?
Best Regards,
Jerry
a bit. I created a
ticket for this (SPARK-10731
<https://issues.apache.org/jira/browse/SPARK-10731>).
Best Regards,
Jerry
On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote:
> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 1
I just noticed you found 1.4 has the same issue. I added that as well in
the ticket.
On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chiling...@gmail.com> wrote:
> Hi Yin,
>
> You are right! I just tried the scala version with the above lines, it
> works as expected.
> I'm n
1 - 100 of 171 matches
Mail list logo