Re: Alternatives for dataframe collectAsList()

2017-04-04 Thread lucas.g...@gmail.com
As Keith said, it depends on what you want to do with your data. >From a pipelining perspective the general flow (YMMV) is: Load dataset(s) -> Transform and / or Join --> Aggregate --> Write dataset Each step in the pipeline does something distinct with the data. The end step is usually

Re: how do i force unit test to do whole stage codegen

2017-04-04 Thread Koert Kuipers
got it. thats good to know. thanks! On Wed, Apr 5, 2017 at 12:07 AM, Kazuaki Ishizaki wrote: > Hi, > The page in the URL explains the old style of physical plan output. > The current style adds "*" as a prefix of each operation that the > whole-stage codegen can be apply

spark stages UI page has 'gc time' column Emtpy

2017-04-04 Thread satishl
Hi, I am using spark 1.6 in YARN cluster mode. When my application runs, I am unable to see gc time metrics in the Spark UI (Application UI->Stages->Tasks). I am attaching the screenshot here. Is this a bug in Spark UI or is this expected?

Re: how do i force unit test to do whole stage codegen

2017-04-04 Thread Kazuaki Ishizaki
Hi, The page in the URL explains the old style of physical plan output. The current style adds "*" as a prefix of each operation that the whole-stage codegen can be apply to. So, in your test case, whole-stage codegen has been already enabled!! FYI. I think that it is a good topic for

With Twitter4j API, why am I not able to pull tweets with certain keywords?

2017-04-04 Thread Gaurav1809
I am using Spark Streaming to with twitter4j API to pull tweets. I am able to pull tweets for some keywords but not for others. If I explicitly tweet with those keywords, even then API does not pull them. For some it is smooth. Has anyone encountered this issue before? Please suggest solution.

Re: Why do we ever run out of memory in Spark Structured Streaming?

2017-04-04 Thread Tathagata Das
Are you referring to the memory usage of stateful operations like aggregations, or the new mapGroupsWithState? The current implementation of the internal state store (that maintains the stateful aggregates) is such that it keeps all the data in memory of the executor. It does use HDFS-compatible

Why do we ever run out of memory in Spark Structured Streaming?

2017-04-04 Thread kant kodali
Why do we ever run out of memory in Spark Structured Streaming especially when Memory can always spill to disk ? until the disk is full we shouldn't be out of memory.isn't it? sure thrashing will happen more frequently and degrades performance but we do we ever run out Memory even in case of

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Jeff Zhang
It is fixed in https://issues.apache.org/jira/browse/SPARK-13330 Holden Karau 于2017年4月5日周三 上午12:03写道: > Which version of Spark is this (or is it a dev build)? We've recently made > some improvements with PYTHONHASHSEED propagation. > > On Tue, Apr 4, 2017 at 7:49 AM Eike

how do i force unit test to do whole stage codegen

2017-04-04 Thread Koert Kuipers
i wrote my own expression with eval and doGenCode, but doGenCode never gets called in tests. also as a test i ran this in a unit test: spark.range(10).select('id as 'asId).where('id === 4).explain according to

Re: Alternatives for dataframe collectAsList()

2017-04-04 Thread Keith Chapman
As Paul said it really depends on what you want to do with your data, perhaps writing it to a file would be a better option, but again it depends on what you want to do with the data you collect. Regards, Keith. http://keith-chapman.com On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern

Re: map transform on array in spark sql

2017-04-04 Thread Michael Armbrust
If you can find the name of the struct field from the schema you can just do: df.select($"arrayField.a") Selecting a field from an array returns an array with that field selected from each element. On Mon, Apr 3, 2017 at 8:18 PM, Koert Kuipers wrote: > i have a DataFrame

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Holden Karau
Which version of Spark is this (or is it a dev build)? We've recently made some improvements with PYTHONHASHSEED propagation. On Tue, Apr 4, 2017 at 7:49 AM Eike von Seggern wrote: 2017-04-01 21:54 GMT+02:00 Paul Tremblay : When I try to to

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Paul Tremblay
So that means I have to pass that bash variable to the EMR clusters when I spin them up, not afterwards. I'll give that a go. Thanks! Henry On Tue, Apr 4, 2017 at 7:49 AM, Eike von Seggern wrote: > 2017-04-01 21:54 GMT+02:00 Paul Tremblay :

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Eike von Seggern
2017-04-01 21:54 GMT+02:00 Paul Tremblay : > When I try to to do a groupByKey() in my spark environment, I get the > error described here: > > http://stackoverflow.com/questions/36798833/what-does-except > ion-randomness-of-hash-of-string-should-be-disabled-via-pythonh >

Re: Alternatives for dataframe collectAsList()

2017-04-04 Thread Eike von Seggern
Hi, depending on what you're trying to achieve `RDD.toLocalIterator()` might help you. Best Eike 2017-03-29 21:00 GMT+02:00 szep.laszlo.it : > Hi, > > after I created a dataset > > Dataset df = sqlContext.sql("query"); > > I need to have a result values and I call a

Re: Executor unable to pick postgres driver in Spark standalone cluster

2017-04-04 Thread Sam Elamin
Hi Rishikesh, Sounds like the postgres driver isnt being loaded on the path. To try and debug it try submit the application with the --jars e.g. spark-submit {application.jar} --jars /home/ubuntu/downloads/ postgres/postgresql-9.4-1200-jdbc41.jar If that does not work then there is a problem