Fwd: autoBroadcastJoinThreshold not working as expected

2019-04-23 Thread Mike Chan
Dear all, I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error. Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760 [image: image.png] Then we proceed to perform query. In the SQL plan, we

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Weichen Xu
Could you provide your code, and running cluster info ? On Tue, Apr 23, 2019 at 4:10 PM Qian He wrote: > The dataset was using a sparse representation before feeding into > LogisticRegression. > > On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu > wrote: > >> Hi Qian, >> >> Do your dataset use

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Qian He
The dataset was using a sparse representation before feeding into LogisticRegression. On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu wrote: > Hi Qian, > > Do your dataset use sparse vector format ? > > > > On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote: > >> Hi all, >> >> I'm using Spark provided

spark 2.4.1 -> 3.0.0-SNAPSHOT mllib

2019-04-23 Thread Koert Kuipers
we recently started compiling against spark 3.0.0-SNAPSHOT (build inhouse from master branch) to uncover any breaking changes that might be an issue for us. we ran into some of our tests breaking where we use mllib. most of it is immaterial: we had some magic numbers hard-coded and the results

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Weichen Xu
Hi Qian, Do your dataset use sparse vector format ? On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote: > Hi all, > > I'm using Spark provided LogisticRegression to fit a dataset. Each row of > the data has 1.7 million columns, but it is sparse with only hundreds of > 1s. The Spark Ui reported

Re: toDebugString - RDD Logical Plan

2019-04-23 Thread kanchan tewary
Hello Dylan, Thank you for help. The result do look formatted after making the change. However, from the following code, I was expecting RDD types like MappedRDD and filteredRDD to be present in the lineage. However, I can only see PythonRDD and parallelCollectionRDD in the lineage [I am running

Re: Update / Delete records in Parquet

2019-04-23 Thread Khare, Ankit
Hi Chetan, I also agree that for this usecase parquet would not be the best option . I had similar usecase , 50 different tables to be download from MSSQL . Source : MSSQL Destination. : Apache KUDU (Since it supports very well change data capture use cases) We used Streamset CDC module to