Is there a way to embed the SparkHistoryServer in my existing service?

2021-06-16 Thread apacheyi
Hi all, We have reached the limit of SparkHistoryServer and need to figure out a way to scale it out and also change some behaviors. I understand that one way is to implement the ApplicationHistoryProvider interface to customize the behavior. I wonder if there is a way to run SparkHistoryServer

Small file problem

2021-06-16 Thread Sachit Murarka
Hello Spark Users, We are receiving too much small small files. About 3 million. Reading it using spark.read itself taking long time and job is not proceeding further. Is there any way to fasten this and proceed? Regards Sachit Murarka

Re: Does Rollups work with spark structured streaming with state.

2021-06-16 Thread Mich Talebzadeh
Hi, Just to clarify Are we talking about* rollup* as a subset of a cube that computes hierarchical subtotals from left to right? view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for

Fwd: Does Rollups work with spark structured streaming with state.

2021-06-16 Thread Amit Joshi
Appreciate if someone could give some pointers in the question below. -- Forwarded message - From: Amit Joshi Date: Tue, Jun 15, 2021 at 12:19 PM Subject: [Spark]Does Rollups work with spark structured streaming with state. To: spark-user Hi Spark-Users, Hope you are all

Re: Spark PROCESS_LOCAL vs RACK_LOCAL, stage not scheduling tasks

2021-06-16 Thread Zilvinas Saltys
Please ignore this lengthy email, It seems the issue was being caused by the fact that the stage was evicting some of the cached memory and therefore this caused the 22 tasks to become slow. Is there a way to see in the logs when evictions happen? I've looked at spark executor metrics and it

Re: Moving millions of file using spark

2021-06-16 Thread Molotch
Definitely not a spark task. Moving files within the same filesystem is merely a linking exercise, you don't have to actually move any data. Write a shell script creating hard links in the new location, once you're satisfied, remove the old links, profit. -- Sent from:

Moving millions of file using spark

2021-06-16 Thread rajat kumar
Hello , I know this might not be a valid use case for spark. But I have millions of files in a single folder. file names are having a pattern. based on pattern I want to move it to different directory. Can you pls suggest what can be done? Thanks rajat

Spark PROCESS_LOCAL vs RACK_LOCAL, stage not scheduling tasks

2021-06-16 Thread Zilvinas Saltys
Hi, I'm running Spark 3.0.1 on AWS. Dynamic allocation is disabled. I'm caching a large dataset 100% in memory. Before caching it I coalesce the dataset to 1792 partitions. There are 112 executors and 896 cores on the cluster. The next stage is reading as input those 1792 partitions. The query

Re: Why does sparkml random forest classifier not support maxBins < number of total categorical values?

2021-06-16 Thread Sean Owen
I think it's because otherwise you would not be able to consider, at least, K-1 splits among K features, and you want to be able to do that. There may be more technical reasons in the code that this is strictly enforced, but it seems like a decent idea. Agree, more than K doesn't seem to help,

Why does sparkml random forest classifier not support maxBins < number of total categorical values?

2021-06-16 Thread Reed Villanueva
Why does sparkml's random forest classifier not support maxBins (M) < (K) number of total categorical values? My

Re: What happens if a random forest max bins is set too high?

2021-06-16 Thread Reed Villanueva
I *think* solved issue. Will update w/ details after further testing / inspection. On Mon, Jun 14, 2021 at 8:50 PM Reed Villanueva wrote: > What happens if a random forest "max bins" hyperparameter is set too high? > > When training a sparkml random forest ( >