Hi Jinhong,
Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.
>From the test dataset you posted, the vector
How to scale or possibly auto-scale a spark streaming application consuming
from kafka and using kafka direct streams. We are using spark 1.6.3, cannot
move to 2.x unless there is a strong reason.
Scenario:
Kafka topic with 10 partitions
Standalone cluster running on kubernetes with 1 master and
This setting allows multiple spark jobs generated through multiple
foreachRDD to run concurrently, even if they are across batches. So output
op2 from batch X, can run concurrently with op1 of batch X+1
This is not safe because it breaks the checkpointing logic in subtle ways.
Note that this was
Hi,
try using this parameter --conf spark.sql.shuffle.partitions=1000
Thanks,
Mohini
On Tue, Mar 14, 2017 at 3:30 PM, kpeng1 wrote:
> Hi All,
>
> I am currently on Spark 1.6 and I was doing a sql join on two tables that
> are over 100 million rows each and I noticed that it
Hi All,
I am currently on Spark 1.6 and I was doing a sql join on two tables that
are over 100 million rows each and I noticed that it was spawn 3+ tasks
(this is the progress meter that we are seeing show up). We tried to
coalesece, repartition and shuffle partitions to drop the number of
Thanks TD for the response. Can you please provide more explanation. I am
having multiple streams in the spark streaming application (Spark 2.0.2
using DStreams). I know many people using this setting. So your
explanation will help a lot of people.
Thanks
On Fri, Mar 10, 2017 at 6:24 PM,
To work around an out of space issue in a Direct Kafka Streaming
application we create topics with a low retention policy (retention.ms=30)
which works fine from Kafka perspective. However this results
into OffsetOutOfRangeException in Spark job (red line below). Is there any
configuration in
I am hoping to open a discussion around the cross validation in mllib. I
found that I often wanted to evaluate multiple estimators/pipelines (with
different algorithms) or the same estimator with different parameter grids.
The CrossValidator and TrainValidationSplit only allow a single estimator
Thanks Kwon. Goal is to preserve whitespace. Not to alter data in general
or do it with user provided options. It's causing our downstream jobs to
fail.
On Mon, Mar 13, 2017 at 7:23 PM, Hyukjin Kwon wrote:
> Hi, all the options are documented in https://spark.apache.org/
>
I'm sorry, I missed some important informations. I use Spark version 2.0.2
in Scala 2.11.8.
2017-03-14 13:44 GMT+01:00 Julian Keppel :
> Hi everybody,
>
> I make some experiments with the Spark kmeans implementation of the new
> DataFrame-API. I compare clustering
Hi everybody,
I make some experiments with the Spark kmeans implementation of the new
DataFrame-API. I compare clustering results of different runs with
different parameters. I recognized that for random initialization mode, the
seed value is the same every time. How is it calculated? In my
Thank you both
Steve that's a very interesting point. I have to admit I have never thought
of doing analysis over time on the tests but it makes sense as the failures
over time tell you quite a bit about your data platform
Thanks for highlighting! We are using Pyspark for now so I hope some
I agree the reporting is an important aspect. Sonarqube (or similar tool) can
report over time, but does not support Scala (well indirectly via JaCoCo). In
the end, you will need to think about a dashboard that displays results over
time.
> On 14 Mar 2017, at 12:44, Steve Loughran
On 13 Mar 2017, at 13:24, Sam Elamin
> wrote:
Hi Jorn
Thanks for the prompt reply, really we have 2 main concerns with CD, ensuring
tests pasts and linting on the code.
I'd add "providing diagnostics when tests fail", which is a
On 14 Mar 2017 4:19 p.m., Gaurav Pandya wrote:
Thanks a lot Michal & Ofir for your insights.
To Ofir - I have not yet finalized my spark streaming code. it is still work in
progress. Now we have Structured streaming available, so thought to re write it
to gain maximum
Thanks a lot Michal & Ofir for your insights.
To Ofir - I have not yet finalized my spark streaming code. it is still
work in progress. Now we have Structured streaming available, so thought to
re write it to gain maximum benefit in future. As of now, there are no
specific functional or
Hi Yuhao,
I have tried numPartitions from (numExecutors * numExecutorCores),
1000, 2000 and 1. I did not see much improvement.
Having more partitions solved some perf issues but did not see any
improvement when I give less minsupport.
It is generating 260 million frequent item sets with
To add to what Michael said, my experience was that Structured Streaming in
2.0 was half-baked / alpha, but in 2.1 it is significantly more robust.
Also a lot of its "missing functionality" were not available in Spark
Streaming either way.
HOWEVER, you mentioned that you think about rewriting your
Hi Raju,
Have you tried setNumPartitions with a larger number?
2017-03-07 0:30 GMT-08:00 Eli Super :
> Hi
>
> It's area of knowledge , you will need to read online several hours about
> it
>
> What is your programming language ?
>
> Try search online : "machine learning
19 matches
Mail list logo