As Keith said, it depends on what you want to do with your data.
>From a pipelining perspective the general flow (YMMV) is:
Load dataset(s) -> Transform and / or Join --> Aggregate --> Write dataset
Each step in the pipeline does something distinct with the data.
The end step is usually
got it. thats good to know. thanks!
On Wed, Apr 5, 2017 at 12:07 AM, Kazuaki Ishizaki
wrote:
> Hi,
> The page in the URL explains the old style of physical plan output.
> The current style adds "*" as a prefix of each operation that the
> whole-stage codegen can be apply
Hi, I am using spark 1.6 in YARN cluster mode. When my application runs, I am
unable to see gc time metrics in the Spark UI (Application
UI->Stages->Tasks). I am attaching the screenshot here.
Is this a bug in Spark UI or is this expected?
Hi,
The page in the URL explains the old style of physical plan output.
The current style adds "*" as a prefix of each operation that the
whole-stage codegen can be apply to.
So, in your test case, whole-stage codegen has been already enabled!!
FYI. I think that it is a good topic for
I am using Spark Streaming to with twitter4j API to pull tweets.
I am able to pull tweets for some keywords but not for others. If I
explicitly tweet with those keywords, even then API does not pull them. For
some it is smooth. Has anyone encountered this issue before? Please suggest
solution.
Are you referring to the memory usage of stateful operations like
aggregations, or the new mapGroupsWithState?
The current implementation of the internal state store (that maintains the
stateful aggregates) is such that it keeps all the data in memory of the
executor. It does use HDFS-compatible
Why do we ever run out of memory in Spark Structured Streaming especially
when Memory can always spill to disk ? until the disk is full we shouldn't
be out of memory.isn't it? sure thrashing will happen more frequently and
degrades performance but we do we ever run out Memory even in case of
It is fixed in https://issues.apache.org/jira/browse/SPARK-13330
Holden Karau 于2017年4月5日周三 上午12:03写道:
> Which version of Spark is this (or is it a dev build)? We've recently made
> some improvements with PYTHONHASHSEED propagation.
>
> On Tue, Apr 4, 2017 at 7:49 AM Eike
i wrote my own expression with eval and doGenCode, but doGenCode never gets
called in tests.
also as a test i ran this in a unit test:
spark.range(10).select('id as 'asId).where('id === 4).explain
according to
As Paul said it really depends on what you want to do with your data,
perhaps writing it to a file would be a better option, but again it depends
on what you want to do with the data you collect.
Regards,
Keith.
http://keith-chapman.com
On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern
If you can find the name of the struct field from the schema you can just
do:
df.select($"arrayField.a")
Selecting a field from an array returns an array with that field selected
from each element.
On Mon, Apr 3, 2017 at 8:18 PM, Koert Kuipers wrote:
> i have a DataFrame
Which version of Spark is this (or is it a dev build)? We've recently made
some improvements with PYTHONHASHSEED propagation.
On Tue, Apr 4, 2017 at 7:49 AM Eike von Seggern
wrote:
2017-04-01 21:54 GMT+02:00 Paul Tremblay :
When I try to to
So that means I have to pass that bash variable to the EMR clusters when I
spin them up, not afterwards. I'll give that a go.
Thanks!
Henry
On Tue, Apr 4, 2017 at 7:49 AM, Eike von Seggern
wrote:
> 2017-04-01 21:54 GMT+02:00 Paul Tremblay :
2017-04-01 21:54 GMT+02:00 Paul Tremblay :
> When I try to to do a groupByKey() in my spark environment, I get the
> error described here:
>
> http://stackoverflow.com/questions/36798833/what-does-except
> ion-randomness-of-hash-of-string-should-be-disabled-via-pythonh
>
Hi,
depending on what you're trying to achieve `RDD.toLocalIterator()` might
help you.
Best
Eike
2017-03-29 21:00 GMT+02:00 szep.laszlo.it :
> Hi,
>
> after I created a dataset
>
> Dataset df = sqlContext.sql("query");
>
> I need to have a result values and I call a
Hi Rishikesh,
Sounds like the postgres driver isnt being loaded on the path. To try and
debug it try submit the application with the --jars
e.g.
spark-submit {application.jar} --jars /home/ubuntu/downloads/
postgres/postgresql-9.4-1200-jdbc41.jar
If that does not work then there is a problem
16 matches
Mail list logo