Hi,
Responding to your queries:
I am using Spark 2.2.1.I have tried with both dynamic resource allocation
turned on and off and have encountered the same behaviour.
The way data is being read is that filepaths (for each independent data
set) are passed to a method, then the method does the
Additionally I meant with modularization that jobs that have really nothing to
do with each other should be in separate python programs
> On 5. Jun 2018, at 04:50, Thakrar, Jayesh
> wrote:
>
> Disclaimer - I use Spark with Scala and not Python.
>
> But I am guessing that Jorn's reference to
Disclaimer - I use Spark with Scala and not Python.
But I am guessing that Jorn's reference to modularization is to ensure that you
do the processing inside methods/functions and call those methods sequentially.
I believe that as long as an RDD/dataset variable is in scope, its memory may
not
The partitionBy clause is used to create hive folders so that you can point
a hive partitioned table on the data .
What are you using the partitionBy for ? What is the use case ?
On Mon 4 Jun, 2018, 4:59 PM purna pradeep, wrote:
> im reading below json in spark
>
> {"bucket": "B01",
Can you tell us what version of Spark you are using and if Dynamic
Allocation is enabled ?
Also, how are the files being read ? Is it a single read of all files using
a file matching regex or are you running different threads in the same
pyspark job?
On Mon 4 Jun, 2018, 1:27 PM Shuporno
Purna,
This behavior is by design. If you provide partitionBy, Spark removes the
columns from the data
From: purna pradeep
Date: Monday, June 4, 2018 at 8:00 PM
To: "user@spark.apache.org"
Subject: spark partitionBy with partitioned column in json output
im reading below json in spark
im reading below json in spark
{"bucket": "B01", "actionType": "A1", "preaction": "NULL",
"postaction": "NULL"}
{"bucket": "B02", "actionType": "A2", "preaction": "NULL",
"postaction": "NULL"}
{"bucket": "B03", "actionType": "A3", "preaction": "NULL",
"postaction": "NULL"}
val
Thanks a lot for the insight.
Actually I have the exact same transformations for all the datasets, hence
only 1 python code.
Now, do you suggest that I run different spark-submit for all the different
datasets given that I have the exact same transformations?
On Tue 5 Jun, 2018, 1:48 AM Jörn
Yes if they are independent with different transformations then I would create
a separate python program. Especially for big data processing frameworks one
should avoid to put everything in one big monotholic applications.
> On 4. Jun 2018, at 22:02, Shuporno Choudhury
> wrote:
>
> Hi,
>
>
All,
I would like to Apply Java Transformation UDF on DataFrame created from
Table, Flat Files and retrun new Data Frame Object. Any suggestions, with
respect to Spark Internals.
Thanks.
Hi,
Thanks for the input.
I was trying to get the functionality first, hence I was using local mode.
I will be running on a cluster definitely but later.
Sorry for my naivety, but can you please elaborate on the modularity
concept that you mentioned and how it will affect whatever I am already
Why don’t you modularize your code and write for each process an independent
python program that is submitted via Spark?
Not sure though if Spark local make sense. If you don’t have a cluster then a
normal python program can be much better.
> On 4. Jun 2018, at 21:37, Shuporno Choudhury
>
Hi everyone,
I am trying to run a pyspark code on some data sets sequentially [basically
1. Read data into a dataframe 2.Perform some join/filter/aggregation 3.
Write modified data in parquet format to a target location]
Now, while running this pyspark code across *multiple independent data sets
Hi there,
I am looking for an example of optimization through Catalyst, that you can
demonstrate via code. Typically, you load some data in a dataframe, you do
something, you do the opposite operation, and, when you collect, it’s super
fast because nothing really happened to the data.
I think also there is a misunderstanding how repartition works. It keeps the
existing number of partitions, but hash partitions according to userid. Means
in each partition it is likely to have different user ids.
That would also explain your observed behavior. However without having the full
How do you load the data? How do you write it?
I fear without a full source code it will be difficult to troubleshoot the
issue.
Which Spark version?
Use case is not yet 100% clear to me. You want to set the row with the
oldest/newest date to true? I would just use top or something similar
Hi All,
Is there a way to create a static dataframe inside mapGroups? given that
mapGroups gives Iterator of rows. I just want to take that iterator and
populate a static dataframe so I can run raw sql queries on the static
dataframe.
Thanks!
yes, issue is with array type only, I have confirmed that.
I exploded array to struct but still getting the same error,
*Exception in thread "main" org.apache.spark.sql.AnalysisException: Union
can only be performed on tables with the compatible column types.
struct
<>
struct
at the 21th column
Thanks Jayesh.
It worked for me.
~Swapnil
On Fri, 1 Jun 2018, 7:10 pm Lalwani, Jayesh,
wrote:
> This will not work the way you have implemented it. The code that you have
> here will be called only once before the streaming query is started. Once
> the streaming query starts, this code is not
Hello!
Thank you very much for your helpful answer and for the very good job
performed in spark-testing-base . I managed to perform unit testing with
spark-testing-base library as the provided article and also get inspired
from
Have you tryed to narrow down the problem so that we can be 100% sure that it
lies on the array types ? Just exclude them for sake of testing.
If we know 100% that it is on this array stuff try to explode that columns into
simple types.
Jorge Machado
> On 4 Jun 2018, at 11:09, Pranav
I am ordering the columns before doing union, so I think that should not be
an issue,
* String[] columns_original_order = baseDs.columns();
String[] columns = baseDs.columns();Arrays.sort(columns);
baseDs=baseDs.selectExpr(columns);
Try the same union with a dataframe without the arrays types. Could be
something strange there like ordering or so.
Jorge Machado
> On 4 Jun 2018, at 10:17, Pranav Agrawal wrote:
>
> schema is exactly the same, not sure why it is failing though.
>
> root
> |-- booking_id: integer
schema is exactly the same, not sure why it is failing though.
root
|-- booking_id: integer (nullable = true)
|-- booking_rooms_room_category_id: integer (nullable = true)
|-- booking_rooms_room_id: integer (nullable = true)
|-- booking_source: integer (nullable = true)
|-- booking_status:
24 matches
Mail list logo