High Availability for spark streaming application running in kubernetes

2020-06-24 Thread Shenson Joseph
Hello, I have a spark streaming application running in kubernetes and we use spark operator to submit spark jobs. Any suggestion on 1. How to handle high availability for spark streaming applications. 2. What would be the best approach to handle high availability of checkpoint data if we don't

Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Joseph Pride
Folks: SnappyData. I’m fairly new to working with it myself, but it looks pretty promising. It marries Spark with a co-located in-memory GemFire (or something gem-related) database. So you can access the data with SQL, JDBC, ODBC (if you wanna go Enterprise instead of open-source) or natively

GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley
/tag/release-0.5.0 *Docs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Thanks to all contributors and to the community for feedback! Joseph -- Joseph Bradley Software Engineer

GraphFrames 0.4.0 release, with Apache Spark 2.1 support

2017-03-28 Thread Joseph Bradley
ocs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: LDA in Spark

2017-03-23 Thread Joseph Bradley
s not have this yet. Please feel free to make a feature request JIRA for it! Thanks, Joseph On Thu, Mar 23, 2017 at 4:54 PM, Mathieu DESPRIEE <mdespr...@bluedme.com> wrote: > Hello Joseph, > > I saw your contribution to online LDA in Spark (SPARK-5563). Please allow > me a very q

SQL warehouse dir

2017-02-10 Thread Joseph Naegele
Hi all, I've read the docs for Spark SQL 2.1.0 but I'm still having issues with the warehouse and related details. I'm not using Hive proper, so my hive-site.xml consists only of: javax.jdo.option.ConnectionURL jdbc:derby:;databaseName=/mnt/data/spark/metastore_db;create=true I've set

Spark SQL 1.6.3 ORDER BY and partitions

2017-01-06 Thread Joseph Naegele
I have two separate but similar issues that I've narrowed down to a pretty good level of detail. I'm using Spark 1.6.3, particularly Spark SQL. I'm concerned with a single dataset for now, although the details apply to other, larger datasets. I'll call it "table". It's around 160 M records,

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Joseph Bradley
= sqlContext.read.parquet(out) >>>>> >>>>> // remove previous checkpoint >>>>> if (iteration > checkpointInterval) { >>>>> *FileSystem.get(sc.hadoopConfiguration)* >>>>> *.delete(n

Storage history in web UI

2017-01-03 Thread Joseph Naegele
Hi all, Is there any way to observe Storage history in Spark, i.e. which RDDs were cached and where, etc. after an application completes? It appears the Storage tab in the History Server UI is useless. Thanks --- Joe Naegele Grier Forensics

RE: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Joseph Naegele
. Thanks --- Joe Naegele Grier Forensics From: Michael Stratton [mailto:michael.strat...@komodohealth.com] Sent: Monday, December 19, 2016 10:00 AM To: Joseph Naegele <jnaeg...@grierforensics.com> Cc: user <user@spark.apache.org> Subject: Re: [Spark SQL] Task failed while

[Spark SQL] Task failed while writing rows

2016-12-18 Thread Joseph Naegele
Hi all, I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). It's current compressed size is around 13 GB, but my problem started when it was much smaller, maybe 5 GB. This dataset is generated

spark nightly builds with Hadoop 2.7

2016-09-09 Thread Joseph Naegele
Hello, I'm using the Spark nightly build "spark-2.1.0-SNAPSHOT-bin-hadoop2.7" from http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/ due to bugs in Spark 2.0.0 (SPARK-16740, SPARK-16802), however I noticed that the recent builds only come in "-hadoop2.4-without-hive" and

Re: GraphFrames 0.2.0 released

2016-08-26 Thread Joseph Bradley
This should do it: https://github.com/graphframes/graphframes/releases/tag/release-0.2.0 Thanks for the reminder! Joseph On Wed, Aug 24, 2016 at 10:11 AM, Maciej Bryński <mac...@brynski.pl> wrote: > Hi, > Do you plan to add tag for this release on github ? > https://github.

Re: Spark Thrift Server Concurrency

2016-06-27 Thread Prabhu Joseph
> data for some queries and not others? > > It sounds like an interesting problem… > > On Jun 23, 2016, at 5:21 AM, Prabhu Joseph <prabhujose.ga...@gmail.com> > wrote: > > Hi All, > >On submitting 20 parallel same SQL query to Spark Thrift Server, the > qu

Spark Thrift Server Concurrency

2016-06-23 Thread Prabhu Joseph
rrency is affected by Single Driver. How to improve the concurrency and what are the best practices. Thanks, Prabhu Joseph

sparkR.init() can not load sparkPackages.

2016-06-16 Thread Joseph
file:/home/hadoop/spark-1.6.1-bin-hadoop2.6/data/mllib/sample_tree_data.csv", "csv") registerTempTable(people, "people") teenagers <- sql(sqlContext, "SELECT * FROM people") head(teenagers) Joseph

The metastore database gives errors when start spark-sql CLI.

2016-05-13 Thread Joseph
0 STATEMENT: SELECT @@version This does not affect normal use, but maybe it is a bug! ( I use spark 1.6.1 and hive 1.2.1) Joseph

When start spark-sql, postgresql gives errors.

2016-05-13 Thread Joseph
0 STATEMENT: SELECT @@version This does not affect normal use, but maybe it is a bug! ( I use spark 1.6.1 and hive 1.2.1) Joseph

When start spark-sql, postgresql gives errors.

2016-05-11 Thread Joseph
0 STATEMENT: SELECT @@version This does not affect normal use, but maybe it is a bug! ( I use spark 1.6.1 and hive 1.2.1) Joseph

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley
+1 By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591 On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia wrote: > This sounds good to me as well. The one thing we should pay attention to > is how we update the docs so

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
Can you try reducing maxBins? That reduces communication (at the cost of coarser discretization of continuous features). On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley <jos...@databricks.com> wrote: > In my experience, 20K is a lot but often doable; 2K is easy; 200 is > small. C

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small. Communication scales linearly in the number of features. On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: > Joseph, > > Correction, there 20k features. Is it still a lot

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Joseph Bradley
First thought: 70K features is *a lot* for the MLlib implementation (and any PLANET-like implementation) Using fewer partitions is a good idea. Which Spark version was this on? On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov wrote: > The questions I have in mind: >

Re: Handling Missing Values in MLLIB Decision Tree

2016-03-22 Thread Joseph Bradley
It does not currently handle surrogate splits. You will need to preprocess your data to remove or fill in missing values. I'd recommend using the DataFrame API for that since it comes with a number of na methods. Joseph On Thu, Mar 17, 2016 at 9:51 PM, Abir Chakraborty <abi...@247-inc.

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
thought, we could probably avoid using indices altogether. I just created https://issues.apache.org/jira/browse/SPARK-14043 to track this. On Mon, Mar 21, 2016 at 11:22 AM, Eugene Morozov <evgeny.a.moro...@gmail.com > wrote: > Hi, Joseph, > > I thought I understood, why it has a lim

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
a design document (Google doc & PDF). Thanks in advance for feedback! Joseph

Spark: The build-in indexes in ORC file do not work.

2016-03-20 Thread Joseph
4 disks per datanode. data size: Toal 800 ORC files, each file is about 51MB, total 560,000,000 rows,57 colunms, only one table named gprs(ORC format). Thanks! Joseph

Improving Spark Scheduler Delay

2016-03-19 Thread Prabhu Joseph
, Prabhu Joseph

Re: Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Joseph
select count(*) from gprs where terminal_type=0;scan all the data Time taken: 395.968 seconds The following is my environment: 3 nodes,12 cpu cores per node,48G memory free per node, 4 disks per node, 3 replications per block , hadoop 2.7.2,

The build-in indexes in ORC file does not work.

2016-03-16 Thread Joseph
ecially in spark SQL)? What's my issue? Thanks! Joseph

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
ted. Meanwhile, there were 41405 Tasks in the the > 163 Stages that were skipped. > > I think -- but the Spark UI's accounting may not be 100% accurate and bug > free. > > On Tue, Mar 15, 2016 at 6:34 PM, Prabhu Joseph <prabhujose.ga...@gmail.com > > wrote: > >> Okay

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
d. >> >> >> On Wed, Mar 16, 2016 at 8:14 AM, Prabhu Joseph < >> prabhujose.ga...@gmail.com> wrote: >> >>> Hi All, >>> >>> >>> Spark UI Completed Jobs section shows below information, what is the >>> skipped value shown for

Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
/14 15:35:32 1.4 min 164/164 * (163 skipped) *19841/19788 *(41405 skipped)* Thanks, Prabhu Joseph

Re: Launch Spark shell using differnt python version

2016-03-15 Thread Prabhu Joseph
in pyspark script. DEFAULT_PYTHON="/ANACONDA/anaconda2/bin/python2.7" Thanks, Prabhu Joseph On Tue, Mar 15, 2016 at 11:52 AM, Stuti Awasthi <stutiawas...@hcl.com> wrote: > Hi All, > > > > I have a Centos cluster (without any sudo permissions) which has by &

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Prabhu Joseph
> > > http://talebzadehmich.wordpress.com > > > > On 14 March 2016 at 08:06, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> Which version of Spark are you using? The configuration varies by version. >> >> Regards >> Sab

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Prabhu Joseph
ply swap > the fractions in your case. > > Regards > Sab > > On Mon, Mar 14, 2016 at 2:20 PM, Prabhu Joseph <prabhujose.ga...@gmail.com > > wrote: > >> It is a Spark-SQL and the version used is Spark-1.2.1. >> >> On Mon, Mar 14, 2016 at 2:16 PM, Sabarish

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Prabhu Joseph
gt;> >> >> >> On 14 March 2016 at 08:06, Sabarish Sasidharan < >> sabarish.sasidha...@manthan.com> wrote: >> >>> Which version of Spark are you using? The configuration varies by >>> version. >>> >>> Regards >>>

Hive Query on Spark fails with OOM

2016-03-13 Thread Prabhu Joseph
for cache. So, when a Spark Executor has lot of memory available for cache and does not use the cache but when there is a need to do lot of shuffle, will executors only use the shuffle fraction which is set for doing shuffle or will it use the free memory available for cache as well. Thanks, Prabhu Joseph

Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192, i suspect some input record has Null key. 189 while (records.hasNext) { 190addElementsRead() 191kv = records.next() 192map.changeValue((getPartition(kv._1), kv._1), update) On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph <prabhujose

Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192 189 while (records.hasNext) { addElementsRead() kv = records.next() map.changeValue((getPartition(kv._1), kv._1), update) maybeSpillCollection(usingMap = true) } On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru wrote: > I am seeing

Re: Spark configuration with 5 nodes

2016-03-10 Thread Prabhu Joseph
, Prabhu Joseph On Fri, Mar 11, 2016 at 3:45 AM, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: > > Hi, > > We intend to use 5 servers which will be utilized for building Bigdata > Hadoop data warehouse system (not using any propriety distribution like > Hortonworks or Cl

Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
just want to be able to replicate hot cached blocks right? > > > On Tuesday, March 8, 2016, Prabhu Joseph <prabhujose.ga...@gmail.com> > wrote: > >> Hi All, >> >> When a Spark Job is running, and one of the Spark Executor on Node A >> has some partitio

Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
shuffle files from an external service instead of from each other which will offload the load on Spark Executors. We want to check whether a similar thing of an External Service is implemented for transferring the cached partition to other executors. Thanks, Prabhu Joseph

Spark Partitioner vs Spark Shuffle Manager

2016-03-07 Thread Prabhu Joseph
Hi All, What is the difference between Spark Partitioner and Spark Shuffle Manager. Spark Partitioner is by default Hash partitioner and Spark shuffle manager is sort based, others are Hash, Tunsten Sort. Thanks, Prabhu Joseph

Spark Custom Partitioner not picked

2016-03-06 Thread Prabhu Joseph
= { val pieces = line.split(' ') val level = pieces(2).toString val one = pieces(0).toString val two = pieces(1).toString (level,LogClass(one,two)) } val output = logData.map(x => parse(x)) *val partitioned = output.partitionBy(new ExactPartitioner(5)).persist()val groups = partitioned.groupByKey(new ExactPartitioner(5))* groups.count() output.partitions.size partitioned.partitions.size } } Thanks, Prabhu Joseph

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Prabhu Joseph
Is all NodeManager services restarted after the change in yarn-site.xml On Thu, Mar 3, 2016 at 6:00 AM, Jeff Zhang wrote: > The executor may fail to start. You need to check the executor logs, if > there's no executor log then you need to check node manager log. > > On Wed,

Spark job on YARN ApplicationMaster DEBUG log

2016-03-02 Thread Prabhu Joseph
Hi All, I am trying to add DEBUG for Spark ApplicationMaster for it is not working. On running Spark job, passed -Dlog4j.configuration=file:/opt/mapr/spark/spark-1.4.1/conf/log4j.properties The log4j.properties has log4j.rootCategory=DEBUG, console Spark Executor Containers has DEBUG logs but

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Prabhu Joseph
Matthias, Can you check appending the jars in LAUNCH_CLASSPATH of spark-1.4.1/sbin/spark_class 2016-03-02 21:39 GMT+05:30 Matthias Niehoff : > no, not to driver and executor but to the master and worker instances of > the spark standalone cluster > > Am 2.

Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Prabhu Joseph
is java old threading is used somewhere. On Friday, February 19, 2016, Jörn Franke <jornfra...@gmail.com> wrote: > How did you configure YARN queues? What scheduler? Preemption ? > > > On 19 Feb 2016, at 06:51, Prabhu Joseph <prabhujose.ga...@gmail.com > <javascrip

Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Prabhu Joseph
are taking 2-3 times longer than A, which shows concurrency does not improve with shared Spark Context. [Spark Job Server] Thanks, Prabhu Joseph

Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Prabhu Joseph
he.spark.sql.hive.HiveContext] > > res0: Boolean = true > > > > On Mon, Feb 15, 2016 at 8:51 PM, Prabhu Joseph <prabhujose.ga...@gmail.com > > wrote: > >> Hi All, >> >> On creating HiveContext in spark-shell, fails with >> >> Caused by:

Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Prabhu Joseph
. But without HiveContext, i am able to query the data using SqlContext . scala> var df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/SPARK/abc") df: org.ap

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Prabhu Joseph
SPARK_MASTER_IP at worker nodes. Check the logs of other workers running to see what SPARK_MASTER_IP it has connected, I don't think it is using a wrong Master IP. Thanks, Prabhu Joseph On Mon, Feb 15, 2016 at 12:34 PM, Kartik Mathur <kar...@bluedata.com> wrote: > Thanks Prabhu , > >

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Prabhu Joseph
in Worker nodes are exactly the same as what Spark Master GUI shows. Thanks, Prabhu Joseph On Mon, Feb 15, 2016 at 11:51 AM, Kartik Mathur <kar...@bluedata.com> wrote: > on spark 1.5.2 > I have a spark standalone cluster with 6 workers , I left the cluster idle > for 3 days and aft

Re: Spark Job on YARN accessing Hbase Table

2016-02-10 Thread Prabhu Joseph
in hadoop-2.5.1 and hence spark.yarn.dist.files does not work with hadoop-2.5.1, spark.yarn.dist.files works fine on hadoop-2.7.0, as CWD/* is included in container classpath through some bug fix. Searching for the JIRA. Thanks, Prabhu Joseph On Wed, Feb 10, 2016 at 4:04 PM, Ted Yu <yuz

Re: Spark Job on YARN accessing Hbase Table

2016-02-10 Thread Prabhu Joseph
of hbase client jars, when i checked launch container.sh , Classpath does not have $PWD/* and hence all the hbase client jars are ignored. Is spark.yarn.dist.files not for adding jars into the executor classpath. Thanks, Prabhu Joseph On Tue, Feb 9, 2016 at 1:42 PM, Prabhu Joseph <prabhujose

Long running Spark job on YARN throws "No AMRMToken"

2016-02-08 Thread Prabhu Joseph
://issues.apache.org/jira/browse/SPARK-5342 spark.yarn.credentials.file How to renew the AMRMToken for a long running job on YARN? Thanks, Prabhu Joseph

Re: Long running Spark job on YARN throws "No AMRMToken"

2016-02-08 Thread Prabhu Joseph
+ Spark-Dev On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Hi All, > > A long running Spark job on YARN throws below exception after running > for few days. > > yarn.ApplicationMaster: Reporter thread

Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Prabhu Joseph
up and launching it on a less-local node. So after making it 0, all tasks started parallel. But learned that it is better not to reduce it to 0. On Mon, Feb 1, 2016 at 2:02 PM, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Hi All, > > > Sample Spark application which re

Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Prabhu Joseph
then programming > must be the process of putting ..." > - Edsger Dijkstra > > "If you pay peanuts you get monkeys" > > > 2016-02-04 11:33 GMT+01:00 Prabhu Joseph <prabhujose.ga...@gmail.com>: > >> Okay, the reason for the task delay within executor

Re: About cache table performance in spark sql

2016-02-03 Thread Prabhu Joseph
does not have enough heap. Thanks, Prabhu Joseph On Thu, Feb 4, 2016 at 11:25 AM, fightf...@163.com <fightf...@163.com> wrote: > Hi, > > I want to make sure that the cache table indeed would accelerate sql > queries. Here is one of my use case : > impala table size : 24.5

Spark saveAsHadoopFile stage fails with ExecutorLostfailure

2016-02-02 Thread Prabhu Joseph
, saveAsHadoopFile runs fine. What could be the reason for ExecutorLostFailure failing when cores per executor is high. Error: ExecutorLostFailure (executor 3 lost) 16/02/02 04:22:40 WARN TaskSetManager: Lost task 1.3 in stage 15.0 (TID 1318, hdnprd-c01-r01-14): Thanks, Prabhu Joseph

Spark Executor retries infinitely

2016-02-01 Thread Prabhu Joseph
2.0 GB RAM 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now LOADING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now RUNNING .... Thanks, Prabhu Joseph

Re: Spark Executor retries infinitely

2016-02-01 Thread Prabhu Joseph
Thanks Ted. My concern is how to avoid these kind of user errors on a production cluster, it would be better if Spark handles this instead of creating an Executor for every second and fails and overloading the Spark Master. Shall i report a Spark JIRA to handle this. Thanks, Prabhu Joseph

Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley
. If you're using ml.clustering.LDAModel (DataFrame API), then you can call transform() on new data. Does that work? Joseph On Tue, Jan 19, 2016 at 6:21 AM, doruchiulan <doru.chiu...@gmail.com> wrote: > Hi, > > Just so you know, I am new to Spark, and also very new to ML (this is my

Spark on YARN job continuously reports "Application does not exist in cache"

2016-01-13 Thread Prabhu Joseph
application attempt, there are many finishApplicationMaster request causing the ERROR. Need your help to understand on what scenario the above happens. JIRA's related are https://issues.apache.org/jira/browse/SPARK-1032 https://issues.apache.org/jira/browse/SPARK-3072 Thanks, Prabhu Joseph

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-11 Thread Prabhu Joseph
machine and jps -l will list all java processes, jstack -l will give the stack trace. Thanks, Prabhu Joseph On Mon, Jan 11, 2016 at 7:56 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Prabhu thanks for the response. How do I find pid of a slow running > task. Task is running in

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-02 Thread Prabhu Joseph
for every 2 seconds and total 1 minute. This will help to identify the code where threads are spending lot of time and then try to tune. Thanks, Prabhu Joseph On Sat, Jan 2, 2016 at 1:28 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi thanks I did that and I have attached thread du

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread Prabhu Joseph
Take thread dump of Executor process several times in a short time period and check what each threads are doing at different times which will help to identify the expensive sections in user code. Thanks, Prabhu Joseph On Sat, Jan 2, 2016 at 3:28 AM, unk1102 <umesh.ka...@gmail.com>

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a problem with the Parquet dependency. What version of Parquet are you building Spark 1.5 off of? (I'm not that familiar with Parquet issues myself, but hopefully a SQL person can chime in.) On Tue, Dec 15, 2015 at 3:23 PM,

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley
. There is not an analogous limit for the GLMs you listed, but I'm not very familiar with the perceptron implementation. Joseph On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <evgeny.a.moro...@gmail.com > wrote: > Hello! > > I'm currently working on POC and try to use Random Forest

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley
You can do grid search if you set the evaluator to a MulticlassClassificationEvaluator, which expects a prediction column, not a rawPrediction column. There's a JIRA for making BinaryClassificationEvaluator accept prediction instead of rawPrediction. Joseph On Tue, Dec 1, 2015 at 5:10 AM

Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+. On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar wrote: > > Hi folks, > > Does anyone know whether the Grid Search capability is enabled since the > issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol > column doesn't exist" when

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi, Could you please submit this via JIRA as a bug report? It will be very helpful if you include the Spark version, system details, and other info too. Thanks! Joseph On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > *Issue:* > >

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
or (3): We use Breeze, but we have to modify it in order to do distributed optimization based on Spark. Joseph On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <javeli...@gmail.com> wrote: > Hi everyone, > > I'm curious about the difference between > ml.class

Re: Serializing MLlib MatrixFactorizationModel

2015-08-17 Thread Joseph Bradley
I'd recommend using the built-in save and load, which will be better for cross-version compatibility. You should be able to call myModel.save(path), and load it back with MatrixFactorizationModel.load(path). On Mon, Aug 17, 2015 at 6:31 AM, Madawa Soysa madawa...@cse.mrt.ac.lk wrote: Hi All,

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
, Jul 25, 2015 at 8:07 AM, Joseph Bradley jos...@databricks.com wrote: I'd recommend starting with a few of the code examples to get a sense of Spark usage (in the examples/ folder when you check out the code). Then, you can work through the Spark methods they call, tracing as deep as needed

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
an interesting (small) JIRA, examine the piece of code it mentions, and explore out from that initial entry point. That's how I mostly did it. Good luck! Joseph On Fri, Jul 24, 2015 at 10:48 AM, shashank kapoor shashank.prof...@gmail.com wrote: Hi guys, I am new to apache spark, I wanted

Error running sbt package on Windows 7 for Spark 1.3.1 and SimpleApp.scala

2015-06-04 Thread Joseph Washington
Hi all, I'm trying to run the standalone application SimpleApp.scala following the instructions on the http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala I was able to create a .jar file by doing sbt package. However when I tried to do $

Re: ALS Rating Object

2015-06-03 Thread Joseph Bradley
as a DeveloperApi to allow users to use Long instead of Int. We're also thinking about better ways to permit Long IDs. Joseph On Wed, Jun 3, 2015 at 5:04 AM, Yasemin Kaya godo...@gmail.com wrote: Hi, I want to use Spark's ALS in my project. I have the userid like 30011397223227125563254

Re: Restricting the number of iterations in Mllib Kmeans

2015-06-01 Thread Joseph Bradley
Hi Suman Meethu, Apologies---I was wrong about KMeans supporting an initial set of centroids! JIRA created: https://issues.apache.org/jira/browse/SPARK-8018 If you're interested in submitting a PR, please do! Thanks, Joseph On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW meethu2...@yahoo.co.in

Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-30 Thread Joseph Bradley
This is really getting into an understanding of how optimization and GLMs work. I'd recommend reading some intro ML or stats literature on how Generalized Linear Models are estimated, as well as how convex optimization is used in ML. There are some free online texts as well as MOOCs which have

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-30 Thread Joseph Bradley
significant variables and deletes the others with a zero in the coefficients? What is a high lambda for you? Is the lambda a parameter available in Spark 1.4 only or can I see it in Spark 1.3? 2015-05-23 0:04 GMT+02:00 Joseph Bradley jos...@databricks.com: If you want to select specific

Re: Multilabel classification using logistic regression

2015-05-27 Thread Joseph Bradley
It looks like you are training each model i (for label i) by only using data with label i. You need to use all of your data to train each model so the models can compare each label i with the other labels (roughly speaking). However, what you're doing is multiclass (not multilabel)

Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread Joseph Bradley
to a build of the current Spark master (or can wait for 1.4), then the org.apache.spark.ml.classification.LogisticRegression implementation has been compared with R and should get very similar results. Good luck! Joseph On Wed, May 27, 2015 at 8:22 AM, SparknewUser melanie.galloi...@gmail.com wrote

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread Joseph Bradley
optimization to minimize a regularized log loss. Good luck! Joseph On Fri, May 22, 2015 at 1:07 PM, DB Tsai dbt...@dbtsai.com wrote: In Spark 1.4, Logistic Regression with elasticNet is implemented in ML pipeline framework. Model selection can be achieved through high lambda resulting lots of zero

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Joseph Bradley
One more comment: That's a lot of categories for a feature. If it makes sense for your data, it will run faster if you can group the categories or split the 1895 categories into a few features which have fewer categories. On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz brk...@gmail.com wrote:

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Joseph Bradley
Pipelines API will be a good option. It now has LogisticRegression which does not do feature scaling, and it uses LBFGS or OWLQN (depending on the regularization type) for optimization. It's also been compared with R in unit tests. Good luck! Joseph On Wed, May 20, 2015 at 3:42 PM, Xin Liu liuxin

Re: Getting the best parameter set back from CrossValidatorModel

2015-05-19 Thread Joseph Bradley
to Params and Pipelines, so this should become easier to use! Joseph On Sun, May 17, 2015 at 10:17 PM, Justin Yip yipjus...@prediction.io wrote: Thanks Ram. Your sample look is very helpful. (there is a minor bug that PipelineModel.stages is hidden under private[ml], just need a wrapper around

Re: Implementing custom metrics under MLPipeline's BinaryClassificationEvaluator

2015-05-18 Thread Joseph Bradley
Hi Justin, It sound like you're on the right track. The best way to write a custom Evaluator will probably be to modify an existing Evaluator as you described. It's best if you don't remove the other code, which handles parameter set/get and schema validation. Joseph On Sun, May 17, 2015

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread Joseph Bradley
instead of maxIterations, which is sort of a bug in the example). If that does not cap the max iterations, then please report it as a bug. To specify the initial centroids, you will need to modify the DenseKMeans example code. Please see the KMeans API docs for details. Good luck, Joseph On Mon

Re: Predict.scala using model for clustering In reference

2015-05-07 Thread Joseph Bradley
A KMeansModel was trained in the previous step, and it was saved to modelFile as a Java object file. This step is loading the model back and reconstructing the KMeansModel, which can then be used to classify new tweets into different clusters. Joseph On Thu, May 7, 2015 at 12:40 PM, anshu shukla

Re: Multilabel Classification in spark

2015-05-05 Thread Joseph Bradley
and RandomForest. Joseph On Tue, May 5, 2015 at 1:27 PM, DB Tsai dbt...@dbtsai.com wrote: LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1

Re: MLLib SVM probability

2015-05-04 Thread Joseph Bradley
/browse/SPARK-7015 I agree that naive one-vs-all reductions might not work that well, but that the raw scores could be calibrated using the scaling you mentioned, or other methods. Joseph On Mon, May 4, 2015 at 6:29 AM, Driesprong, Fokko fo...@driesprong.frl wrote: Hi Robert, I would say, taking

Re: [Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-26 Thread Joseph Bradley
a good way to set the buffer size automatically, though. Joseph On Fri, Apr 24, 2015 at 8:20 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi i have a next problem. I have a dataset with 30 columns (15 numeric, 15 categorical) and using ml transformers/estimators to transform each column

Re: Multiclass classification using Ml logisticRegression

2015-04-26 Thread Joseph Bradley
multiclass for now, until that JIRA gets completed. (If you're interested in doing it, let me know via the JIRA!) Joseph On Fri, Apr 24, 2015 at 3:26 AM, Selim Namsi selim.na...@gmail.com wrote: Hi, I just started using spark ML pipeline to implement a multiclass classifier using

Re: How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-25 Thread Joseph Bradley
It looks like your code is making 1 Row per item, which means that columnSimilarities will compute similarities between users. If you transpose the matrix (or construct it as the transpose), then columnSimilarities should do what you want, and it will return meaningful indices. Joseph On Fri

Re: KMeans takeSample jobs and RDD cached

2015-04-25 Thread Joseph Bradley
, they are unpersisted (uncached) at the end. Joseph On Sat, Apr 25, 2015 at 6:36 AM, podioss grega...@hotmail.com wrote: Hi, i am running k-means algorithm with initialization mode set to random and various dataset sizes and values for clusters and i have a question regarding the takeSample job

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Joseph Bradley
. Good luck! Joseph On Tue, Apr 21, 2015 at 4:58 AM, ayan guha guha.a...@gmail.com wrote: Hi I am getting an error Also, I am getting an error in mlib.ALS.train function when passing dataframe (do I need to convert the DF to RDD?) Code: training = ssc.sql(select userId,movieId,rating

  1   2   >