MLLIB , Does Spark support Canopy Clustering ?

2019-04-02 Thread Alok Bhandari
Hello All , I am interested to use bisecting k-means algorithm implemented in spark. While using bisecting k-means I found that some of my clustering requests on large data-set failed with OOM issues. As data-set size is expected to be large , so I wanted to use some pre-processing steps to reduc

Load Time from HDFS

2019-04-02 Thread Jack Kolokasis
Hello,     I want to ask if there any way to measure HDFS data loading time at the start of my program. I tried to add an action e.g count() after val data = sc.textFile() call. But I notice that my program takes more time to finish than before adding count call. Is there any other way to do i

[Spark ML] [Pyspark] [Scenario Beginner] [Level Beginner]

2019-04-02 Thread Steve Pruitt
I am still struggling with getting fit() to work on my dataset. The Spark ML exception that is the issue is: LAPACK.dppsv returned 6 because A is not positive definite. Is A derived from a singular matrix (e.g. collinear column values)? Comparing my standardized Weight values with the tutorial's

Issues with Spark Streaming checkpointing of Kafka topic content

2019-04-02 Thread Dmitry Goldenberg
Hi, I've got 3 questions/issues regarding checkpointing, was hoping someone could help shed some light on this. We've got a Spark Streaming consumer consuming data from a Kafka topic; works fine generally until I switch it to the checkpointing mode by calling the 'checkpoint' method on the contex

Re: Issues with Spark Streaming checkpointing of Kafka topic content

2019-04-02 Thread Dmitry Goldenberg
To add more info, this project is on an older version of Spark, 1.5.0, and on an older version of Kafka which is 0.8.2.1 (2.10-0.8.2.1). On Tue, Apr 2, 2019 at 11:39 AM Dmitry Goldenberg wrote: > Hi, > > I've got 3 questions/issues regarding checkpointing, was hoping someone > could help shed so

Re: How to extract data in parallel from RDBMS tables

2019-04-02 Thread Surendra , Manchikanti
Looking for a generic solution, not for a specific DB or number of tables. On Fri, Mar 29, 2019 at 5:04 AM Jason Nerothin wrote: > How many tables? What DB? > > On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti < > surendra.manchika...@gmail.com> wrote: > >> Hi Jason, >> >> Thanks for your r

Re: How to extract data in parallel from RDBMS tables

2019-04-02 Thread Jason Nerothin
I can *imagine* writing some sort of DataframeReader-generation tool, but am not aware of one that currently exists. On Tue, Apr 2, 2019 at 13:08 Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote: > > Looking for a generic solution, not for a specific DB or number of tables. > > > On

Logging DataFrame API pipelines

2019-04-02 Thread Magnus Nilsson
Hello all, How do you log what is happening inside your Spark Dataframe pipelines? I would like to collect statistics along the way, mostly count of rows at particular steps, to see where rows where filtered and what not. Is there any other way to do this than calling .count on the dataframe? R