Product similarity with TF/IDF and Cosine similarity (DIMSUM)

2016-01-30 Thread Alan Prando
Hi Folks! I am trying to implement a spark job to calculate the similarity of my database products, using only name and descriptions. I would like to use TF-IDF to represent my text data and cosine similarity to calculate all similarities. My goal is, after job completes, get all similarities

Spark saveAsText file size

2014-11-24 Thread Alan Prando
Hi Folks! I'm running a spark JOB on a cluster with 9 slaves and 1 master (250GB RAM, 32 cores each and 1TB of storage each). This job generates 1.200 TB of data on a RDD with 1200 partitions. When I call saveAsTextFile(hdfs://...), spark creates 1200 files named part-000* on HDFS's folder.

MLIB KMeans Exception

2014-11-20 Thread Alan Prando
Hi Folks! I'm running a Python Spark job on a cluster with 1 master and 10 slaves (64G RAM and 32 cores each machine). This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and call Kmeans method as following: # SLAVE CODE - Reading features from HDFS def

Re: Spark on YARN

2014-11-19 Thread Alan Prando
Owen so...@cloudera.com: My guess is you're asking for all cores of all machines but the driver needs at least one core, so one executor is unable to find a machine to fit on. On Nov 18, 2014 7:04 PM, Alan Prando a...@scanboo.com.br wrote: Hi Folks! I'm running Spark on YARN cluster installed

Spark on YARN

2014-11-18 Thread Alan Prando
Hi Folks! I'm running Spark on YARN cluster installed with Cloudera Manager Express. The cluster has 1 master and 3 slaves, each machine with 32 cores and 64G RAM. My spark's job is working fine, however it seems that just 2 of 3 slaves are working (htop shows 2 slaves working 100% on 32 cores,

Reading from Hbase using python

2014-11-12 Thread Alan Prando
Hi all, I'm trying to read an hbase table using this an example from github ( https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py), however I have two qualifiers in a column family. Ex.: ROW COLUMN+CELL row1 column=f1:1, timestamp=1401883411986,