Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-19 Thread Varun Shah
No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation Optimization: --> Delaying data tra

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
eyan Chakravarty wrote: > > On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh > wrote: > >> >> No Data Transfer During Creation: --> Data transfer occurs only when an >> action is triggered. >> Distributed Processing: --> DataFrames are distributed

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver no

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
to disk. look at stages tab in UI (4040) *In summary:* No Data Transfer During Creation: --> Data transfer occurs only when an action is triggered. Distributed Processing: --> DataFrames are distributed for parallel execution, not stored entirely on the driver node. Lazy Evaluation Optimi

pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty
I am trying to understand Spark Architecture. For Dataframes that are created from python objects ie. that are *created in memory where are they stored ?* Take following example: from pyspark.sql import Rowimport datetime courses = [ { 'course_id': 1, 'course_title

Re: EXT: Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-28 Thread Vibhor Gupta
m Verma Sent: Monday, December 26, 2022 8:08 PM To: Russell Jurney Cc: Gurunandan ; user@spark.apache.org Subject: EXT: Re: Check if shuffle is caused for repartitioned pyspark dataframes EXTERNAL: Report suspicious emails to Email Abuse. I tried sorting the repartitioned dataframes on the pa

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-26 Thread Shivam Verma
I tried sorting the repartitioned dataframes on the partition key before saving them as parquet files, however, when I read those repartitioned-sorted dataframes and join them on the partition key, the spark plan still shows `Exchange hashpartitioning` step, which I want to avoid

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-23 Thread Russell Jurney
, but I can see it in both > the experiments: > 1. Using repartitioned dataframes > 2. Using initial dataframes > > Does that mean that the repartitioned dataframes are not actually > "co-partitioned"? > If that's the case, I have two more questions: > > 1.

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-23 Thread Shivam Verma
Hi Gurunandan, Thanks for the reply! I do see the exchange operator in the SQL tab, but I can see it in both the experiments: 1. Using repartitioned dataframes 2. Using initial dataframes Does that mean that the repartitioned dataframes are not actually "co-partitioned"? If that's t

Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-13 Thread Shivam Verma
Hello folks, I have a use case where I save two pyspark dataframes as parquet files and then use them later to join with each other or with other tables and perform multiple aggregations. Since I know the column being used in the downstream joins and groupby, I was hoping I could use co

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Hollis
this is the reason you got the IOM and analysis exception. my suggestion is you need checkpoint the dataframe when joined 200 dataframes. so you can trancate the lineage. so the optimizer only analysis the 200 dataframe. this will reduce the pressure of spark engine. | | Hollis | Replied

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
dataframes may apply and RDD are used, but for UDF's I prefer SQL as well, but that may be a personal idiosyncrasy. The Oreilly book on data algorithms using SPARK, pyspark uses dataframes and RDD API's :) Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 6:11 PM Sean Owen wrote

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Sean Owen
will be faster. Sometimes you have to go outside SQL where necessary, like in UDFs or complex aggregation logic. Then you can't use SQL. On Fri, Dec 24, 2021 at 12:05 PM Gourav Sengupta wrote: > Hi, > > yeah I think that in practice you will always find that dataframes can > give issu

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
Hi, yeah I think that in practice you will always find that dataframes can give issues regarding a lot of things, and then you can argue. In the SPARK conference, I think last year, it was shown that more than 92% or 95% use the SPARK SQL API, if I am not mistaken. I think that you can do

OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Andrew Davidson
Hi Sean and Gourav Thanks for the suggestions. I thought that both the sql and dataframe apis are wrappers around the same frame work? Ie. catalysts. I tend to mix and match my code. Sometimes I find it easier to write using sql some times dataframes. What is considered best practices? Here

Re: Merge two dataframes

2021-05-19 Thread ayan guha
gt; On Wed, May 12, 2021 at 11:07 AM Sean Owen wrote: >> > >> > Yeah I don't think that's going to work - you aren't guaranteed to get >> 1, 2, 3, etc. I think row_number() might be what you need to generate a >> join ID. >> > >> > RDD has a .zip

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh
't think that's going to work - you aren't guaranteed to get > 1, 2, 3, etc. I think row_number() might be what you need to generate a > join ID. > > > > RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. > You could .zip two RDDs you get from DataFrames and

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh
med. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Tue, 18 May 2021 at 16:39, kushagra deep >> wrote: >> >>> The use case is to calculate PSI/CSI values .

Re: Merge two dataframes

2021-05-18 Thread kushagra deep
18 May 2021 at 16:39, kushagra deep > wrote: > >> The use case is to calculate PSI/CSI values . And yes the union is one to >> one row as you showed. >> >> On Tue, May 18, 2021, 20:39 Mich Talebzadeh >> wrote: >> >>> >>> Hi Kushagra, >>>

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
; >> A bit late on this but what is the business use case for this merge? >> >> You have two data frames each with one column and you want to UNION them >> in a certain way but the correlation is not known. In other words this >> UNION is as is? >> >>amount_6m | amou

Re: Merge two dataframes

2021-05-18 Thread kushagra deep
shagra deep > wrote: > >> Hi All, >> >> I have two dataframes >> >> df1 >> >> amount_6m >> 100 >> 200 >> 300 >> 400 >> 500 >> >> And a second data df2 below >> >> amount_9m >>

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
500 200 600 HTH On Wed, 12 May 2021 at 13:51, kushagra deep wrote: > Hi All, > > I have two dataframes > > df1 > > amount_6m > 100 > 200 > 300 > 400 > 500 > > And a second data df2 below > > amount_9m > 500 > 600

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
al implementation, we had > a series of r one per rule. For N rules, we created N dataframes that had the > rows that satisfied the rules. The we unioned the N data frames. Horrible > performance that didn't scale with N. We reimplemented to add N Boolean > columns; one p

Re: Merge two dataframes

2021-05-17 Thread Lalwani, Jayesh
a series of r one per rule. For N rules, we created N dataframes that had the rows that satisfied the rules. The we unioned the N data frames. Horrible performance that didn't scale with N. We reimplemented to add N Boolean columns; one per rule; that indicated if the rule was satisfied. We just kept

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
In our case, these UDFs are quite expensive and worked on in an iterative manner, so being able to cache the two "sides" of the graphs independently will speed up the development cycle. Otherwise, if you modify foo() here, then you have to recompute bar and baz, even though they're unchanged.

Re: Merge two dataframes

2021-05-17 Thread Sean Owen
Why join here - just add two columns to the DataFrame directly? On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: > Anyone have ideas about the below Q? > > It seems to me that given that "diamond" DAG, that spark could see > that the rows haven't been shuffled/filtered, it could do some type

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
you aren't guaranteed to get 1, > > 2, 3, etc. I think row_number() might be what you need to generate a join > > ID. > > > > RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You > > could .zip two RDDs you get from DataFrames and manually co

Re: Merge two dataframes

2021-05-12 Thread Andrew Melo
DataFrame does not. You > could .zip two RDDs you get from DataFrames and manually convert the Rows > back to a single Row and back to DataFrame. > > > On Wed, May 12, 2021 at 10:47 AM kushagra deep > wrote: >> >> Thanks Raghvendra >> >> Will

Re: Merge two dataframes

2021-05-12 Thread Sean Owen
Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, 3, etc. I think row_number() might be what you need to generate a join ID. RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You could .zip two RDDs you get from DataFrames and manually convert

Re: Merge two dataframes

2021-05-12 Thread kushagra deep
Thanks Raghvendra Will the ids for corresponding columns be same always ? Since monotonic_increasing_id() returns a number based on partitionId and the row number of the partition ,will it be same for corresponding columns? Also is it guaranteed that the two dataframes will be divided

Re: Merge two dataframes

2021-05-12 Thread Raghavendra Ganesh
ot;), "inner").drop("id").show() +-+-+ |amount_6m|amount_9m| +-+-+ | 100| 500| | 200| 600| | 300| 700| | 400| 800| | 500| 900| +-+-+ -- Raghavendra On Wed, May 12, 2021 at 6:20 PM kushagra deep w

Merge two dataframes

2021-05-12 Thread kushagra deep
Hi All, I have two dataframes df1 amount_6m 100 200 300 400 500 And a second data df2 below amount_9m 500 600 700 800 900 The number of rows is same in both dataframes. Can I merge the two dataframes to achieve below df df3 amount_6m | amount_9m 100

Find difference between two dataframes in spark structured streaming

2020-12-16 Thread act_coder
I am creating a spark structured streaming job, where I need to find the difference between two dataframes. Dataframe 1 : [1, item1, value1] [2, item2, value2] [3, item3, value3] [4, item4, value4] [5, item5, value5] Dataframe 2: [4, item4, value4] [5, item5, value5] New Dataframe

Re: Refreshing Data in Spark Memory (DataFrames)

2020-11-13 Thread Lalwani, Jayesh
will be incurring IO overhead on every microbatch. From: Arti Pande Date: Friday, November 13, 2020 at 2:19 PM To: "Lalwani, Jayesh" Cc: "user@spark.apache.org" Subject: RE: [EXTERNAL] Refreshing Data in Spark Memory (DataFrames) CAUTION: This email originated from outside of

Re: Refreshing Data in Spark Memory (DataFrames)

2020-11-13 Thread Arti Pande
old > computation, your results don’t change. > There might be scenarios where you want to correct old reference data. In > this case you update your reference table, and rerun your computation. > > > > Now, if you are talking about streaming applications, then it’s a > d

Re: Refreshing Data in Spark Memory (DataFrames)

2020-11-13 Thread Lalwani, Jayesh
reference data. Spark reloads the dataframes from batch sources at the beginning of every microbatch. As long as you are reading the data from from a non-streaming source, it will get refreshed in every microbatch. Alternatively, you can send updates to reference data through a stream, and then merge

Refreshing Data in Spark Memory (DataFrames)

2020-11-13 Thread Arti Pande
Hi In the financial systems world, if some data is being updated too frequently, and that data is to be used as reference data by a Spark job that runs for 6/7 hours, most likely Spark job may read that data at the beginning and keep it in memory as DataFrame and will keep running for remaining

Bloom Filter to filter huge dataframes with PySpark

2020-09-23 Thread Breno Arosa
Hello, I need to filter one huge table using others huge tables. I could not avoid sort operation using `WHERE IN` or `INNER JOIN`. Can this be avoided? As I'm ok with false positives maybe Bloom filter is an alternative. I saw that Scala has a builtin dataframe function

Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

2020-06-11 Thread Tanveer Ahmad - EWI
Hi Jorge, Thank you. This union function is better alternative for my work. Regards, Tanveer Ahmad From: Jorge Machado Sent: Monday, May 25, 2020 3:56:04 PM To: Tanveer Ahmad - EWI Cc: Spark Group Subject: Re: Arrow RecordBatches/Pandas Dataframes to (Arrow

Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

2020-05-25 Thread Jorge Machado
Hey, from what I know you can try to Union them df.union(df2) Not sure if this is what you need > On 25. May 2020, at 13:53, Tanveer Ahmad - EWI wrote: > > Hi all, > > I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow > enabled) Spark Dataframe c

Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

2020-05-25 Thread Tanveer Ahmad - EWI
Hi all, I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversions. Here the example explains very well how to convert a single Pandas Dataframe to Spark Dataframe [1]. But in my case, some external applications are generating Arrow

Re: How to split a dataframe into two dataframes based on count

2020-05-18 Thread Vipul Rajan
fle. Spark automatically caches when a data shuffle happens. Let me know if you get it to work. Regards On Mon, May 18, 2020 at 10:27 PM Mohit Durgapal wrote: > Dear All, > > I would like to know how, in spark 2.0, can I split a dataframe into two > dataframes when I know the exact cou

How to split a dataframe into two dataframes based on count

2020-05-18 Thread Mohit Durgapal
Dear All, I would like to know how, in spark 2.0, can I split a dataframe into two dataframes when I know the exact counts the two dataframes should have. I tried using limit but got quite weird results. Also, I am looking for exact counts in child dfs, not the approximate % based split

[Spark MLlib]: Multiple input dataframes and non-linear ML pipeline

2020-04-09 Thread Qingsheng Ren
Hi all, I'm using ML Pipeline to construct a flow of transformation. I'm wondering if it is possible to set multiple dataframes as the input of a transformer? For example I need to join two dataframes together in a transformer, then feed into the estimator for training. If not, is there any plan

[Spark MLlib]: Multiple input dataframes and non-linear ML pipeline

2020-04-09 Thread Qingsheng Ren
Hi all, I'm using ML Pipeline to construct a flow of transformation. I'm wondering if it is possible to set multiple dataframes as the input of a transformer? For example I need to join two dataframes together in a transformer, then feed into the estimator for training. If not, is there any plan

Spark 2.2.1 Dataframes multiple joins bug?

2020-03-23 Thread Dipl.-Inf. Rico Bergmann
Hi all! Is it possible that Spark creates under certain circumstances duplicate rows when doing multiple joins? What I did: buse.count res0: Long = 20554365 buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count res1: Long = 20554365

Re: Questions about count() performance with dataframes and parquet files

2020-02-18 Thread Nicolas PARIS
> either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) I wonder if avro is a better candidate for this because it's row oriented it should be faster to write/read for such a task. Never heard about checkpoint. Enrico Minack writes: > It is not about very large or small, it is

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Enrico Minack
It is not about very large or small, it is about how large your cluster is w.r.t. your data. Caching is only useful if you have the respective memory available across your executors. Otherwise you could either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) or indeed have to do

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Nicolas PARIS
> .dropDuplicates() \ .cache() | > Since df_actions is cached, you can count inserts and updates quickly > with only that one join in df_actions: Hi Enrico. I am wondering if this is ok for very large tables ? Is caching faster than recomputing both insert/update ? Thanks Enrico Minack

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Ashley Hoff
Hi, Thank you both for your suggestions! These have been eyeopeners for me. Just to clarify, I need the counts for logging and auditing purposes otherwise I would exclude the step. I should have also mentioned that while I am processing around 30 GB of raw data, the individual outputs are

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Enrico Minack
Ashley, I want to suggest a few optimizations. The problem might go away but at least performance should improve. The freeze problems could have many reasons, the Spark UI SQL pages and stages detail pages would be useful. You can send them privately, if you wish. 1. the repartition(1)

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi ashley, Apologies reading this on my phone as work l laptop doesn't let me access personal email. Are you actually doing anything with the counts (printing to log, writing to table?) If you're not doing anything with them get rid of them and the caches entirely. If you do want to do

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
Thanks David, I did experiment with the .cache() keyword and have to admit I didn't see any marked improvement on the sample that I was running, so yes I am a bit apprehensive including it (not even sure why I actually left it in). When you say "do the count as the final step", are you referring

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi Ashley, I'm not an expert but think this is because spark does lazy execution and doesn't actually perform any actions until you do some kind of write, count or other operation on the dataframe. If you remove the count steps it will work out a more efficient execution plan reducing the number

Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
Hi, I am currently working on an app using PySpark to produce an insert and update daily delta capture, being outputted as Parquet. This is running on a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of 2GB memory each) running Spark 2.4.3. This is being achieved by reading

Re: Re: union two pyspark dataframes from different SparkSessions

2020-01-29 Thread Zong-han, Xie
Dear Yeikel I checked my code and it uses getOrCreate to create a SparkSession. Therefore, I should be retrieving the same SparkSession instance everytime I call that method. Thanks for your reminding. Best regard -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: union two pyspark dataframes from different SparkSessions

2020-01-29 Thread yeikel valdes
d HDFS with given parameters. This function returns a pyspark dataframe and the SparkContext it used. With client's increasing demands, I need to merge data from multiple query. I tested using "union" function to merge the pyspark dataframes returned by different function calls direct

union two pyspark dataframes from different SparkSessions

2020-01-29 Thread Zong-han, Xie
; function to merge the pyspark dataframes returned by different function calls directly and it worked. This surprised me that pyspark dataframe can actually union dataframes from different SparkSession. I am using pyspark 2.3.1 and Python 3.5. I wonder if this is a good practice or I bette

Re: Loop through Dataframes

2019-10-06 Thread Holden Karau
there. On Sun, Oct 6, 2019 at 2:49 PM KhajaAsmath Mohammed wrote: > > Hi, > > What is the best approach to loop through 3 dataframes in scala based on > some keys instead of using collect. > > Thanks, > Asmath > -- Twitter: https://twitter.com/holdenkarau Books (Learning Sp

Loop through Dataframes

2019-10-06 Thread KhajaAsmath Mohammed
Hi, What is the best approach to loop through 3 dataframes in scala based on some keys instead of using collect. Thanks, Asmath

Cogrouping in Streaming Datasets/DataFrames is not supported ?

2019-08-23 Thread Kushagra Deep
Hi , I have a use case where I have to cogroup two streams using cogroup in streaming. However when I do so I get an exception that “Cogrouping in streaming is not supported in DataFrame/Dataset”. Please clarify. Regards , Kushagra Deep

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Russell Spitzer
ts of the plan refer to static pieces of >> data ..."* Could you elaborate a bit more on what does this static >> piece of data refer to? Are you referring to the 10 records that had >> already arrived at T1 and are now sitting as old static data in the >> unbounded t

[Structured Streaming]: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi
tatic pieces of >> data ..."* Could you elaborate a bit more on what does this static >> piece of data refer to? Are you referring to the 10 records that had >> already arrived at T1 and are now sitting as old static data in the >> unbounded table? >> >> R

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi
t; ..."* Could you elaborate a bit more on what does this static piece of > data refer to? Are you referring to the 10 records that had already arrived > at T1 and are now sitting as old static data in the unbounded table? > > Regards > Sheel > > > On Thu, May 16, 201

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi
AM Russell Spitzer wrote: > Dataframes describe the calculation to be done, but the underlying > implementation is an "Incremental Query". That is that the dataframe code > is executed repeatedly with Catalyst adjusting the final execution plan on > each run. Some parts of

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-15 Thread Russell Spitzer
Dataframes describe the calculation to be done, but the underlying implementation is an "Incremental Query". That is that the dataframe code is executed repeatedly with Catalyst adjusting the final execution plan on each run. Some parts of the plan refer to static pieces of data, ot

Are Spark Dataframes mutable in Structured Streaming?

2019-05-15 Thread Sheel Pancholi
Hi Structured Streaming treats a stream as an unbounded table in the form of a DataFrame. Continuously flowing data from the stream keeps getting added to this DataFrame (which is the unbounded table) which warrants a change to the DataFrame which violates the vary basic nature of a DataFrame

Re: Standardized Join Types for DataFrames

2019-02-22 Thread Jules Damji
am new to spark and want to start contributing to Apache spark to know more > about it. > I found this JIRA to have "Standardized Join Types for DataFrames", which I > feel could be a good starter task for me. I wanted to confirm if this is a > relevant/actionable task and i

Standardized Join Types for DataFrames

2019-02-22 Thread Pooja Agrawal
Hi, I am new to spark and want to start contributing to Apache spark to know more about it. I found this JIRA to have "Standardized Join Types for DataFrames", which I feel could be a good starter task for me. I wanted to confirm if this is a relevant/actionable task and if I can sta

RE: What are the alternatives to nested DataFrames?

2018-12-29 Thread email
: What are the alternatives to nested DataFrames? 2 options I can think of: 1- Can you perform a union of dfs returned by elastic research queries. It would still be distributed but I don't know if you will run out of how many union operations you can perform at a time. 2- Can you used some

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus
* em...@yeikel.com > *Cc:* Shahab Yunus ; user > *Subject:* Re: What are the alternatives to nested DataFrames? > > > > Could you join() the DFs on a common key? > > > > On Fri, Dec 28, 2018 at 18:35 wrote: > > Shabad , I am not sure what you are trying to say. Could you

RE: What are the alternatives to nested DataFrames?

2018-12-28 Thread email
iginal DF and returns a new dataframe including all the matching terms From: Andrew Melo Sent: Friday, December 28, 2018 8:48 PM To: em...@yeikel.com Cc: Shahab Yunus ; user Subject: Re: What are the alternatives to nested DataFrames? Could you join() the DFs on a common key?

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Andrew Melo
tString(0)* > > > > * val qb = QueryBuilders.matchQuery("name", > city).operator(Operator.AND)* > > * print(qb.toString)* > > > > * val dfs = sqlContext.esDF("cities/docs", qb.toString) // null > pointer* > > > > * dfs.show()* > > >

RE: What are the alternatives to nested DataFrames?

2018-12-28 Thread email
uery("name", city).operator(Operator.AND) print(qb.toString) val dfs = sqlContext.esDF("cities/docs", qb.toString) // null pointer dfs.show() }) From: Shahab Yunus Sent: Friday, December 28, 2018 12:34 PM To: em...@yeikel.com Cc: user Sub

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus
oes not support the nesting of > DataFrames , but what are the options? > > > > I have the following scenario : > > > > dataFrame1 = List of Cities > > > > dataFrame2 = Created after searching in ElasticSearch for each city in > dataFrame1 > > > &g

What are the alternatives to nested DataFrames?

2018-12-27 Thread email
Hi community , As shown in other answers online , Spark does not support the nesting of DataFrames , but what are the options? I have the following scenario : dataFrame1 = List of Cities dataFrame2 = Created after searching in ElasticSearch for each city in dataFrame1 I've

Spark column combinations and combining multiple dataframes (pyspark)

2018-11-26 Thread Christopher Petrino
to a python array like the_dfs.append(df.select(cols).toDF(*cols).cache()) the_dfs[len(the_dfs)].count() The dataframes are finally combined using df_all = reduce(DataFrame.union, the_dfs).cache() df_all.count() THE CURRENT STATE: The proof of concept works on a smaller amount of data

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-02 Thread Nirav Patel
= key._1 and P2 = key._2") > } > > Regards, > Nirav > > > On Wed, Aug 1, 2018 at 4:18 PM, Koert Kuipers wrote: > >> this works for dataframes with spark 2.3 by changing a global setting, >> and will be configurable per write in 2.4 >> see: >>

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-02 Thread Nirav Patel
Nirav On Wed, Aug 1, 2018 at 4:18 PM, Koert Kuipers wrote: > this works for dataframes with spark 2.3 by changing a global setting, and > will be configurable per write in 2.4 > see: > https://issues.apache.org/jira/browse/SPARK-20236 > https://issues.apache.org/jira/browse/SPARK

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-01 Thread Koert Kuipers
this works for dataframes with spark 2.3 by changing a global setting, and will be configurable per write in 2.4 see: https://issues.apache.org/jira/browse/SPARK-20236 https://issues.apache.org/jira/browse/SPARK-24860 On Wed, Aug 1, 2018 at 3:11 PM, Nirav Patel wrote: > Hi Peay, > >

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-01 Thread Nirav Patel
Hi Peay, Have you find better solution yet? I am having same issue. Following says it works with spark 2.1 onward but only when you use sqlContext and not Dataframe https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spark-2e2b818a007a Thanks, Nirav On Mon, Oct 2, 2017 at 4:37

Zstd codec for writing dataframes

2018-06-18 Thread Nikhil Goyal
Hi guys, I was wondering if there is a way to compress files using zstd. It seems zstd compression can be used for shuffle data, is there a way to use it for output data as well? Thanks Nikhil

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-28 Thread Jacek Laskowski
message from Kafka using Spark Structured Streaming(SSS) and explode >>> the data and flatten all data into single record using DataFrame joins >>> and >>> land into a relational database table(DB2). >>> >>> But we are getting the following error when we write

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-15 Thread रविशंकर नायर
e >> the data and flatten all data into single record using DataFrame joins and >> land into a relational database table(DB2). >> >> But we are getting the following error when we write into db using JDBC. >> >> “org.apache.spark.sql.AnalysisException: Inner join between two stre

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-13 Thread Jacek Laskowski
all data into single record using DataFrame joins and > land into a relational database table(DB2). > > But we are getting the following error when we write into db using JDBC. > > “org.apache.spark.sql.AnalysisException: Inner join between two streaming > DataFrames/Datasets is

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-12 Thread ThomasThomas
Thanks for the quick response...I'm able to inner join the dataframes with regular spark session. The issue is only with the spark streaming session. BTW I'm using Spark 2.2.0 version... -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-12 Thread रविशंकर नायर
getting the following error when we write into db using JDBC. > > “org.apache.spark.sql.AnalysisException: Inner join between two streaming > DataFrames/Datasets is not supported;” > > Any help would be greatly appreciated. > > Thanks, > Thomas Thomas > Mastermind Solut

Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-12 Thread ThomasThomas
table(DB2). But we are getting the following error when we write into db using JDBC. “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;” Any help would be greatly appreciated.
 Thanks, Thomas Thomas Mastermind Solutions LLC. -- Sent

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky
I suppose that it's hardly possible that this issue is connected with the string encoding, because - "pr^?files.10056.10040" should be "profiles.10056.10040" and is defined as constant in the source code -

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Jörn Franke
Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different? > On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky wrote: > > Hello guys, > > I'm using Spark 2.2.0 and from time to time my job fails printing into > the log the following errors > >

DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky
Hello guys, I'm using Spark 2.2.0 and from time to time my job fails printing into the log the following errors scala.MatchError: profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@ scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)

Re: strange behavior of joining dataframes

2018-03-23 Thread Shiyuan
.col("nL")>1) df = df.join(df_t.select("ID"),["ID"]) df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1") df = df.join(df_sw, ["ID","kk"]) On Tue, Mar 20, 2018 at 9:58 PM, Shiyuan <gshy

strange behavior of joining dataframes

2018-03-20 Thread Shiyuan
Hi Spark-users: I have a dataframe "df_t" which was generated from other dataframes by several transformations. And then I did something very simple, just counting the rows, that is the following code: (A) df_t_1 = df_t.groupby(["Id","key"]).count().withColumnR

Caching dataframes and overwrite

2017-11-21 Thread Michael Artz
I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Shixiong(Ryan) Zhu
You are using Spark Streaming Kafka package. The correct package name is " org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0" On Mon, Nov 20, 2017 at 4:15 PM, salemi wrote: > Yes, we are using --packages > > $SPARK_HOME/bin/spark-submit --packages >

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread salemi
Yes, we are using --packages $SPARK_HOME/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 --py-files shell.py -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Holden Karau
org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 ... On Mon, Nov 20, 2017 at 3:07 PM, salemi <alireza.sal...@udo.edu> wrote: > Hi All, > > we are trying to use DataFrames approach with Kafka 0.10 and PySpark 2.2.0. > We followed the instruction on the wiki > https://spark.a

PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread salemi
Hi All, we are trying to use DataFrames approach with Kafka 0.10 and PySpark 2.2.0. We followed the instruction on the wiki https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. We coded something similar to the code below using Python: df = spark \ .read \ .format

Union of streaming dataframes

2017-11-17 Thread Lalwani, Jayesh
Is union of 2 Structured streaming dataframes from different sources supported in 2.2? We have done a union of 2 streaming dataframes that are from the same source. I wanted to know if multiple streams are supported or going to be supported in the future

  1   2   3   4   5   6   >