Re: can't download 2.4.1 sourcecode

2019-04-22 Thread 1101300123
I get it from github and building now,but i hope someone can fix the website so moreone use it | | 1101300123 | | 邮箱:hdxg1101300...@163.com | 签名由 网易邮箱大师 定制 On 04/23/2019 11:56, Andrew Melo wrote: On Mon, Apr 22, 2019 at 10:54 PM yutaochina wrote: > >

Re: can't download 2.4.1 sourcecode

2019-04-22 Thread Andrew Melo
On Mon, Apr 22, 2019 at 10:54 PM yutaochina wrote: > > >

Re: Update / Delete records in Parquet

2019-04-22 Thread Chetan Khatri
Hello Jason, Thank you for reply. My use case is that, first time I do full load and transformation/aggregation/joins and write to parquet (as staging) but next time onwards my source is MSSQL Server, I want to pull only those records got changed / updated and would like to update at parquet also

can't download 2.4.1 sourcecode

2019-04-22 Thread yutaochina
when i want download the sourcecode find it

Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-22 Thread Qian He
Hi all, I'm using Spark provided LogisticRegression to fit a dataset. Each row of the data has 1.7 million columns, but it is sparse with only hundreds of 1s. The Spark Ui reported high GC time when the model is being trained. And my spark application got stuck without any response. I have

Re: Connecting to Spark cluster remotely

2019-04-22 Thread Andrew Melo
Hi Rishkesh On Mon, Apr 22, 2019 at 4:26 PM Rishikesh Gawade wrote: > > To put it simply, what are the configurations that need to be done on the > client machine so that it can run driver on itself and executors on > spark-yarn cluster nodes? TBH, if it were me, I would simply SSH to the

Re: Update / Delete records in Parquet

2019-04-22 Thread Jason Nerothin
Hi Chetan, Do you have to use Parquet? It just feels like it might be the wrong sink for a high-frequency change scenario. What are you trying to accomplish? Thanks, Jason On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri wrote: > Hello All, > > If I am doing incremental load / delta and would

Re: Connecting to Spark cluster remotely

2019-04-22 Thread Rishikesh Gawade
To put it simply, what are the configurations that need to be done on the client machine so that it can run driver on itself and executors on spark-yarn cluster nodes? On Mon, Apr 22, 2019, 8:22 PM Rishikesh Gawade wrote: > Hi. > I have been experiencing trouble while trying to connect to a

Update / Delete records in Parquet

2019-04-22 Thread Chetan Khatri
Hello All, If I am doing incremental load / delta and would like to update / delete the records in parquet, I understands that parquet is immutable and can't be deleted / updated theoretically only append / overwrite can be done. But I can see utility tools which claims to add value for that.

Re: How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-22 Thread Tathagata Das
SQL windows with the 'over' syntax does not work in Structured Streaming. It is very hard to incrementalize that in the general case. Hence non-time windows are not supported. On Sat, Apr 20, 2019, 2:16 PM Stephen Boesch wrote: > Consider the following *intended* sql: > > select row_number() >

Re: Use derived column for other derived column in the same statement

2019-04-22 Thread Vipul Rajan
Hi Rishi, TL;DR Using Scala, this would work df.withColumn("derived1", lit("something")).withColumn("derived2", col("derived1") === "something") just note that I used 3 equal to signs instead of just two. That should be enough, if you want to understand why read further. so "==" gives boolean

Re: Structured Streaming initialized with cached data or others

2019-04-22 Thread Vipul Rajan
Please look into arbitrary stateful aggregation. I do not completely understand your problem though. If you could give me an example. I'd be happy to help On Mon, 22 Apr 2019, 15:31 shicheng31...@gmail.com, wrote: > Hi ,all: > As we all known, structured streaming is used to handle

Connecting to Spark cluster remotely

2019-04-22 Thread Rishikesh Gawade
Hi. I have been experiencing trouble while trying to connect to a Spark cluster remotely. This Spark cluster is configured to run using YARN. Can anyone guide me or provide any step-by-step instructions for connecting remotely via spark-shell? Here's the setup that I am using: The Spark cluster is

Structured Streaming initialized with cached data or others

2019-04-22 Thread shicheng31...@gmail.com
Hi ,all: As we all known, structured streaming is used to handle incremental problems. However, if I need to make an increment based on an initial value, I need to get a previous state value when the program is initialized. Is there any way to assign an initial value to the'state'