Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-21 Thread Rishi Shah
ion/CacheManager.scala#L219 > > And can you pls tell what issue was solved in spark 3, which you are > referring. > > Regards > Amit > > > On Saturday, September 19, 2020, Rishi Shah > wrote: > >> Thanks Amit. I have tried increasing driver memory , als

Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-18 Thread Rishi Shah
if it works. > > Regards, > Amit > > > On Thursday, September 17, 2020, Rishi Shah > wrote: > >> Hello All, >> >> Hope this email finds you well. I have a dataframe of size 8TB (parquet >> snappy compressed), however I group it by a column and get a much

[pyspark 2.4] broadcasting DataFrame throws error

2020-09-16 Thread Rishi Shah
? -- Regards, Rishi Shah

Re: [pyspark 2.3+] read/write huge data with smaller block size (128MB per block)

2020-06-19 Thread Rishi Shah
e here is different still -- you have serious data skew > because you partitioned by date, and I suppose some dates have lots of > data, some have almost none. > > > On Fri, Jun 19, 2020 at 12:17 AM Rishi Shah > wrote: > >> Hi All, >> >> I have about 10TB of

[pyspark 2.3+] Add scala library to pyspark app and use to derive columns

2020-06-06 Thread Rishi Shah
, Rishi Shah

Re: [PySpark] Tagging descriptions

2020-06-04 Thread Rishi Shah
On Thu, May 14, 2020, 21:14 Amol Umbarkar wrote: > >> Check out sparkNLP for tokenization. I am not sure about solar or elastic >> search though >> >> On Thu, May 14, 2020 at 9:02 PM Rishi Shah >> wrote: >> >>> This is great, thanks you Zhang & A

Re: [PySpark 2.3+] Reading parquet entire path vs a set of file paths

2020-06-03 Thread Rishi Shah
Hi All, Just following up on below to see if anyone has any suggestions. Appreciate your help in advance. Thanks, Rishi On Mon, Jun 1, 2020 at 9:33 AM Rishi Shah wrote: > Hi All, > > I use the following to read a set of parquet file paths when files are > scattered across many man

[PySpark 2.3+] Reading parquet entire path vs a set of file paths

2020-06-01 Thread Rishi Shah
tion, is that correct? If I put all these files in a single path and read like below - works faster: path = 'consolidated_path' df = spark.read.parquet(path) Is my observation correct? If so, is there a way to optimize reads from multiple/specific paths ? -- Regards, Rishi Shah

[pyspark 2.3+] Dedupe records

2020-05-29 Thread Rishi Shah
nput is highly appreciated! -- Regards, Rishi Shah

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Rishi Shah
l data. You could use hash of > some kind to join back. Though I would go for this approach only if the > chances of similarity in text are very high (it could be in your case for > being transactional data). > > Not the full answer to your question but hope this helps you brainstorm &g

Re: [PySpark] Tagging descriptions

2020-05-12 Thread Rishi Shah
: > May I get some requirement details? > > Such as: > 1. The row count and one row data size > 2. The avg length of text to be parsed by RegEx > 3. The sample format of text to be parsed > 4. The sample of current RegEx > > -- > Cheers, > -z > > On Mon, 11 May

[PySpark] Tagging descriptions

2020-05-11 Thread Rishi Shah
Hi All, I have a tagging problem at hand where we currently use regular expressions to tag records. Is there a recommended way to distribute & tag? Data is about 10TB large. -- Regards, Rishi Shah

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
have to repeat > indefinitely. See this blog post for more details! > > https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html > > Best, > Burak > > On Fri, May 1, 2020 at 2:55 PM Rishi Shah > wrote: > >> Hi All, >> >>

[spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
with checkpoint location feature to achieve fault tolerance in batch processing application? -- Regards, Rishi Shah

Re: [pyspark 2.4+] BucketBy SortBy doesn't retain sort order

2020-03-03 Thread Rishi Shah
Hi All, Just checking in to see if anyone has any advice on this. Thanks, Rishi On Mon, Mar 2, 2020 at 9:21 PM Rishi Shah wrote: > Hi All, > > I have 2 large tables (~1TB), I used the following to save both the > tables. Then when I try to join both tables with join_column, i

[pyspark 2.4+] BucketBy SortBy doesn't retain sort order

2020-03-02 Thread Rishi Shah
able(tablename) -- Regards, Rishi Shah

Re: High level explanation of dropDuplicates

2020-01-11 Thread Rishi Shah
r dataset or what are the alternatives. From the > explain output I can see the two Exchanges , so it may not be the best > approach? > > > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Regards, Rishi Shah

Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-06 Thread Rishi Shah
eed to salt the keys for better partitioning. > Are you using a coalesce or any other fn which brings the data to lesser > nodes. Window function also incurs shuffling that could be an issue. > > On Mon, 6 Jan 2020 at 9:49 AM, Rishi Shah > wrote: > >> Thanks Hemant, underly

Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-05 Thread Rishi Shah
memory in individual executors. > Job is getting completed may be because when tasks are re-scheduled it > would be going through. > > Thanks. > > On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah > wrote: > >> Hello All, >> >> One of my jobs, keep getting into th

[pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-05 Thread Rishi Shah
Hello All, One of my jobs, keep getting into this situation where 100s of tasks keep failing with below error but job eventually completes. org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory Could someone advice? -- Regards, Rishi Shah

Re: [Pyspark 2.3+] Timeseries with Spark

2019-12-29 Thread Rishi Shah
sure of the best way/use cases to use these libraries for or if there's any other preferred way of tackling time series problems at scale. Thanks, -Shraddha On Sun, Jun 16, 2019 at 9:17 AM Rishi Shah wrote: > Thanks Jorn. I am interested in timeseries forecasting for now but in > general I

[pyspark 2.3+] broadcast timeout

2019-12-09 Thread Rishi Shah
period or shutting off broadcast completely by setting the auto broadcast property to -1? -- Regards, Rishi Shah

[pyspark 2.4.0] write with overwrite mode fails

2019-12-07 Thread Rishi Shah
previous run was abruptly interrupted and the partitioned directory only has _started flag file & no _SUCCESS or _committed. In this case, second run doesn't overwrite, causing partition to have duplicated files. Could someone please help? -- Regards, Rishi Shah

[pyspark 2.4] maxrecordsperfile option

2019-11-23 Thread Rishi Shah
)? -- Regards, Rishi Shah

Re: [pyspark 2.3.0] Task was denied committing errors

2019-11-10 Thread Rishi Shah
Hi Team, I could really use your insight here, any help is appreciated! Thanks, Rishi On Wed, Nov 6, 2019 at 8:27 PM Rishi Shah wrote: > Any suggestions? > > On Wed, Nov 6, 2019 at 7:30 AM Rishi Shah > wrote: > >> Hi All, >> >> I have two relatively

Re: [pyspark 2.3.0] Task was denied committing errors

2019-11-06 Thread Rishi Shah
Any suggestions? On Wed, Nov 6, 2019 at 7:30 AM Rishi Shah wrote: > Hi All, > > I have two relatively big tables and join on them keeps throwing > TaskCommitErrors, eventually job succeeds but I was wondering what these > errors are and if there's any solution? > > -- >

[pyspark 2.3.0] Task was denied committing errors

2019-11-06 Thread Rishi Shah
Hi All, I have two relatively big tables and join on them keeps throwing TaskCommitErrors, eventually job succeeds but I was wondering what these errors are and if there's any solution? -- Regards, Rishi Shah

Re: [pyspark 2.4.3] nested windows function performance

2019-10-21 Thread Rishi Shah
Hi All, Any suggestions? Thanks, -Rishi On Sun, Oct 20, 2019 at 12:56 AM Rishi Shah wrote: > Hi All, > > I have a use case where I need to perform nested windowing functions on a > data frame to get final set of columns. Example: > > w1 = Window.partitionBy('col1') > df

[pyspark 2.4.3] nested windows function performance

2019-10-19 Thread Rishi Shah
idea in pyspark? I wanted to avoid using multiple filter + joins to get to the final state, as join can create crazy shuffle. Any suggestions would be appreciated! -- Regards, Rishi Shah

[pyspark 2.4.3] small input csv ~3.4GB gets 40K tasks created

2019-08-29 Thread Rishi Shah
Hi All, I am scratching my head against this weird behavior, where df (read from .csv) of size ~3.4GB gets cross joined with itself and creates 50K tasks! How to correlate input size with number of tasks in this case? -- Regards, Rishi Shah

[python 2.4.3] correlation matrix

2019-08-28 Thread Rishi Shah
Hi All, What is the best way to calculate correlation matrix? -- Regards, Rishi Shah

Re: [Pyspark 2.4] not able to partition the data frame by dates

2019-07-31 Thread Rishi Shah
on? And what is > the environment? > > I know this issue occurs intermittently over large writes in S3 and has to > do with S3 eventual consistency issues. Just restarting the job sometimes > helps. > > > Regards, > Gourav Sengupta > > On Thu, Aug 1, 2019 at 3:55 AM Rishi Shah &

[Pyspark 2.4] not able to partition the data frame by dates

2019-07-31 Thread Rishi Shah
').parquet(PATH) I did notice that couple of tasks failed and probably that's why it tried spinning up new ones which write to the same .staging directory? -- Regards, Rishi Shah

[Pyspark 2.4] Large number of row groups in parquet files created using spark

2019-07-24 Thread Rishi Shah
application using spark-submit? df = spark.read.parquet(INPUT_PATH) df.coalesce(1).write.parquet(OUT_PATH) I did try --conf spark.parquet.block.size & spark.dfs.blocksize, but that makes no difference. -- Regards, Rishi Shah

[pyspark 2.4.0] write with partitionBy fails due to file already exits

2019-07-01 Thread Rishi Shah
wrong here.. -- Regards, Rishi Shah

Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Rishi Shah
are performing any operations on the DF before the countDistinct? > > I recall there was a bug when I did countDistinct(PythonUDF(x)) in the > same query which was resolved in one of the minor versions in 2.3.x > > On Sat, Jun 29, 2019, 10:32 Rishi Shah wrote: > >> Hi All,

Re: [pyspark 2.3+] CountDistinct

2019-06-28 Thread Rishi Shah
Hi All, Just wanted to check in to see if anyone has any insight about this behavior. Any pointers would help. Thanks, Rishi On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah wrote: > Hi All, > > Recently we noticed that countDistinct on a larger dataframe doesn't > always return the sam

Re: [Pyspark 2.3+] Timeseries with Spark

2019-06-16 Thread Rishi Shah
you > describe more what you mean by time series use case, ie what is the input, > what do you like to do with the input and what is the output? > > > Am 14.06.2019 um 06:01 schrieb Rishi Shah : > > > > Hi All, > > > > I have a time series use case which I would

[pyspark 2.3+] CountDistinct

2019-06-14 Thread Rishi Shah
Hi All, Recently we noticed that countDistinct on a larger dataframe doesn't always return the same value. Any idea? If this is the case then what is the difference between countDistinct & approx_count_distinct? -- Regards, Rishi Shah

[Pyspark 2.3+] Timeseries with Spark

2019-06-13 Thread Rishi Shah
Hi All, I have a time series use case which I would like to implement in Spark... What would be the best way to do so? Any built in libraries? -- Regards, Rishi Shah

[pyspark 2.3+] count distinct returns different value every time it is run on the same dataset

2019-06-11 Thread Rishi Shah
Hi All, countDistinct on dataframe returns different results every time it is run, I expect that when approxCountDistinct is used but even for countDistinct()? Is there a way to get accurate count using pyspark (deterministic result)? -- Regards, Rishi Shah

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-10 Thread Rishi Shah
itmap is the best choice. > 在 2019/6/5 下午6:49, Rishi Shah 写道: > > Hi All, > > Is there a best practice around calculating daily, weekly, monthly, > quarterly, yearly active users? > > One approach is to create a window of daily bitmap and aggregate it based > on period

Re: High level explanation of dropDuplicates

2019-06-09 Thread Rishi Shah
sql/Dataset.scala#L2326 >> >> Could someone please explain? >> >> Thank you >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- Regards, Rishi Shah

[pyspark 2.3+] Querying non-partitioned @TB data table is too slow

2019-06-09 Thread Rishi Shah
redicate pushdown and filtering without scanning the entire dataset? -- Regards, Rishi Shah

[Pyspark 2.4] Best way to define activity within different time window

2019-06-05 Thread Rishi Shah
, Rishi Shah

Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-31 Thread Rishi Shah
odically running a compaction job. > > > > If you’re simply appending daily snapshots, then you could just consider > using date partitions, instead? > > > > *From: *Rishi Shah > *Date: *Thursday, May 30, 2019 at 10:43 PM > *To: *"user @spark" > *Subje

[pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-30 Thread Rishi Shah
you need to rewrite entire table every time, can someone help advice? -- Regards, Rishi Shah

Re: [pyspark 2.3+] how to dynamically determine DataFrame partitions while writing

2019-05-22 Thread Rishi Shah
Hi All, Any idea about this? Thanks, Rishi On Tue, May 21, 2019 at 11:29 PM Rishi Shah wrote: > Hi All, > > What is the best way to determine partitions of a dataframe dynamically > before writing to disk? > > 1) statically determine based on data and use coalesce or r

[pyspark 2.3+] repartition followed by window function

2019-05-22 Thread Rishi Shah
on keys)? -- Regards, Rishi Shah

[pyspark 2.3+] how to dynamically determine DataFrame partitions while writing

2019-05-21 Thread Rishi Shah
- however how to determine total count without having to risk computing dataframe twice (if dataframe is not cached, and count() is used) -- Regards, Rishi Shah

[pyspark 2.3] count followed by write on dataframe

2019-05-20 Thread Rishi Shah
taframe).. Could anyone please help confirm? -- Regards, Rishi Shah

Re: Spark job gets hung on cloudera cluster

2019-05-17 Thread Rishi Shah
t; particular data block is unavailable due to node failures. > > Can you check if your YARN service can communicate with Name node service? > > Akshay Bhardwaj > +91-97111-33849 > > > On Thu, May 16, 2019 at 4:27 PM Rishi Shah > wrote: > >> on yarn

Re: Spark job gets hung on cloudera cluster

2019-05-16 Thread Rishi Shah
on yarn On Thu, May 16, 2019 at 1:36 AM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Rishi, > > Are you running spark on YARN or spark's master-slave cluster? > > Akshay Bhardwaj > +91-97111-33849 > > > On Thu, May 16, 2019 at 7:15 AM Rishi S

Re: Databricks - number of executors, shuffle.partitions etc

2019-05-16 Thread Rishi Shah
reate new cluster or add > them later. Go to cluster page, ipen one cluster, expand additional config > section and add your param there as key value pair separated by space. > > On Thu, 16 May 2019 at 11:46 am, Rishi Shah > wrote: > >> Hi All, >> >> Any idea?

Re: Databricks - number of executors, shuffle.partitions etc

2019-05-15 Thread Rishi Shah
Hi All, Any idea? Thanks, -Rishi On Tue, May 14, 2019 at 11:52 PM Rishi Shah wrote: > Hi All, > > How can we set spark conf parameter in databricks notebook? My cluster > doesn't take into account any spark.conf.set properties... it creates 8 > worker nodes (dat executors) bu

Re: Spark job gets hung on cloudera cluster

2019-05-15 Thread Rishi Shah
Any one please? On Tue, May 14, 2019 at 11:51 PM Rishi Shah wrote: > Hi All, > > At times when there's a data node failure, running spark job doesn't fail > - it gets stuck and doesn't return. Any setting can help here? I would > ideally like to get the job terminated or ex

Databricks - number of executors, shuffle.partitions etc

2019-05-14 Thread Rishi Shah
Hi All, How can we set spark conf parameter in databricks notebook? My cluster doesn't take into account any spark.conf.set properties... it creates 8 worker nodes (dat executors) but doesn't honor the supplied conf parameters. Any idea? -- Regards, Rishi Shah

Spark job gets hung on cloudera cluster

2019-05-14 Thread Rishi Shah
Hi All, At times when there's a data node failure, running spark job doesn't fail - it gets stuck and doesn't return. Any setting can help here? I would ideally like to get the job terminated or executors running on those data nodes fail... -- Regards, Rishi Shah

[pyspark 2.3] drop_duplicates, keep first record based on sorted records

2019-05-13 Thread Rishi Shah
Hi All, Is there a better way to drop duplicates, and keep first record based on sorted column? simple sorting on dataframe and dropping duplicates is quite slow! -- Regards, Rishi Shah

[Pyspark 2.3] Logical operators (and/or) in pyspark

2019-05-13 Thread Rishi Shah
match_flag', (col('list_names').isNull()) | (contains(col('name'), col('list_names' Here where list_names is null, it starts to throw an error : NoneType is not iterable. Any idea? -- Regards, Rishi Shah

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-04 Thread Rishi Shah
> > On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah > wrote: > >> modified the subject & would like to clarify that I am looking to create >> an anaconda parcel with pyarrow and other libraries, so that I can >> distribute it on the cloudera cluster.. >>

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-29 Thread Rishi Shah
modified the subject & would like to clarify that I am looking to create an anaconda parcel with pyarrow and other libraries, so that I can distribute it on the cloudera cluster.. On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah wrote: > Hi All, > > I have been trying to figure out a

Anaconda installation with Pyspark on cloudera managed server

2019-04-29 Thread Rishi Shah
... tarred the directory, but this directory doesn't include all the packages to form a proper parcel for distribution. Any help is much appreciated! -- Regards, Rishi Shah

[pyspark] Use output of one aggregated function for another aggregated function within the same groupby

2019-04-24 Thread Rishi Shah
col('max_date'))) Please note 'max_date' is a result of aggregate function max inside the group by agg. I can definitely use multiple groupbys to achieve this but is there a better way? better performance wise may be? Appreciate your help! -- Regards, Rishi Shah

RDD vs Dataframe & when to persist

2019-04-24 Thread Rishi Shah
Hello All, I run into situations where I ask myself should I write map partitions function on RDD or use dataframe all the way (with column + group by ) approach.. I am using Pyspark 2.3 (python 2.7).. I understand we should be utilizing dataframe as much as possible but at time it feels like RDD

Use derived column for other derived column in the same statement

2019-04-21 Thread Rishi Shah
Hello All, How can we use a derived column1 for deriving another column in the same dataframe operation statement? something like: df = df.withColumn('derived1', lit('something')) .withColumn('derived2', col('derived1') == 'something') -- Regards, Rishi Shah