i found RDD.reduceBy() is really useful and much more efficient than
groupBy(). Wondering if DS/DF have the similar apis?
Sometimes it's convenient to start a spark-shell on cluster, like
./spark/bin/spark-shell --master yarn --deploy-mode client --num-executors
100 --executor-memory 15g --executor-cores 4 --driver-memory 10g --queue
myqueue
However, with command like this, those allocated resources will be occupied
park Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, May 14, 2019 at 5:08 PM Qian He wrote:
>
>
For example, I have a dataframe with 3 columns: URL, START, END. For each
url from URL column, I want to fetch a substring of it starting from START
and ending at END.
++--+-+
|URL|START |END |
++--+-+
I have a 1TB dataset with 100 columns. The first column is a user_id, there
are about 1000 unique user_ids in this 1TB dataset.
The use case: I want to train a ML model for each user_id on this user's
records (approximately 1GB records per user). Say the ML model is a
Decision Tree. But it is not
The dataset was using a sparse representation before feeding into
LogisticRegression.
On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu
wrote:
> Hi Qian,
>
> Do your dataset use sparse vector format ?
>
>
>
> On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote:
>
>> Hi all,
Hi all,
I'm using Spark provided LogisticRegression to fit a dataset. Each row of
the data has 1.7 million columns, but it is sparse with only hundreds of
1s. The Spark Ui reported high GC time when the model is being trained. And
my spark application got stuck without any response. I have