date:20190423

Fwd: autoBroadcastJoinThreshold not working as expected

2019-04-23 Thread Mike Chan

Dear all, I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error. Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760 [image: image.png] Then we proceed to perform query. In the SQL plan, we

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Weichen Xu

Could you provide your code, and running cluster info ? On Tue, Apr 23, 2019 at 4:10 PM Qian He wrote: > The dataset was using a sparse representation before feeding into > LogisticRegression. > > On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu > wrote: > >> Hi Qian, >> >> Do your dataset use

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Qian He

The dataset was using a sparse representation before feeding into LogisticRegression. On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu wrote: > Hi Qian, > > Do your dataset use sparse vector format ? > > > > On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote: > >> Hi all, >> >> I'm using Spark provided

spark 2.4.1 -> 3.0.0-SNAPSHOT mllib

2019-04-23 Thread Koert Kuipers

we recently started compiling against spark 3.0.0-SNAPSHOT (build inhouse from master branch) to uncover any breaking changes that might be an issue for us. we ran into some of our tests breaking where we use mllib. most of it is immaterial: we had some magic numbers hard-coded and the results

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Weichen Xu

Hi Qian, Do your dataset use sparse vector format ? On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote: > Hi all, > > I'm using Spark provided LogisticRegression to fit a dataset. Each row of > the data has 1.7 million columns, but it is sparse with only hundreds of > 1s. The Spark Ui reported

Re: toDebugString - RDD Logical Plan

2019-04-23 Thread kanchan tewary

Hello Dylan, Thank you for help. The result do look formatted after making the change. However, from the following code, I was expecting RDD types like MappedRDD and filteredRDD to be present in the lineage. However, I can only see PythonRDD and parallelCollectionRDD in the lineage [I am running

Re: Update / Delete records in Parquet

2019-04-23 Thread Khare, Ankit

Hi Chetan, I also agree that for this usecase parquet would not be the best option . I had similar usecase , 50 different tables to be download from MSSQL . Source : MSSQL Destination. : Apache KUDU (Since it supports very well change data capture use cases) We used Streamset CDC module to

Fwd: autoBroadcastJoinThreshold not working as expected

Re: Spark LogisticRegression got stuck on dataset with millions of columns

Re: Spark LogisticRegression got stuck on dataset with millions of columns

spark 2.4.1 -> 3.0.0-SNAPSHOT mllib

Re: Spark LogisticRegression got stuck on dataset with millions of columns

Re: toDebugString - RDD Logical Plan

Re: Update / Delete records in Parquet

7 matches

Site Navigation

Mail list logo

Footer information