Dear all,
I'm on a case that when certain table being exposed to broadcast join, the
query will eventually failed with remote block error.
Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely
10485760
[image: image.png]
Then we proceed to perform query. In the SQL plan, we
Could you provide your code, and running cluster info ?
On Tue, Apr 23, 2019 at 4:10 PM Qian He wrote:
> The dataset was using a sparse representation before feeding into
> LogisticRegression.
>
> On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu
> wrote:
>
>> Hi Qian,
>>
>> Do your dataset use
The dataset was using a sparse representation before feeding into
LogisticRegression.
On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu
wrote:
> Hi Qian,
>
> Do your dataset use sparse vector format ?
>
>
>
> On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote:
>
>> Hi all,
>>
>> I'm using Spark provided
we recently started compiling against spark 3.0.0-SNAPSHOT (build inhouse
from master branch) to uncover any breaking changes that might be an issue
for us.
we ran into some of our tests breaking where we use mllib. most of it is
immaterial: we had some magic numbers hard-coded and the results
Hi Qian,
Do your dataset use sparse vector format ?
On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote:
> Hi all,
>
> I'm using Spark provided LogisticRegression to fit a dataset. Each row of
> the data has 1.7 million columns, but it is sparse with only hundreds of
> 1s. The Spark Ui reported
Hello Dylan,
Thank you for help. The result do look formatted after making the change.
However, from the following code, I was expecting RDD types like MappedRDD
and filteredRDD to be present in the lineage. However, I can only see
PythonRDD and parallelCollectionRDD in the lineage [I am running
Hi Chetan,
I also agree that for this usecase parquet would not be the best option . I had
similar usecase ,
50 different tables to be download from MSSQL .
Source : MSSQL
Destination. : Apache KUDU (Since it supports very well change data capture use
cases)
We used Streamset CDC module to