RE: Hash Partitioning and Dataframes

2015-05-08 Thread Daniel, Ronald (ELS-SDG)
, May 08, 2015 3:15 PM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: Hash Partitioning and Dataframes What are you trying to accomplish? Internally Spark SQL will add Exchange operators to make sure that data is partitioned correctly for joins and aggregations. If you

RE: How to do spares vector product in Spark?

2015-03-13 Thread Daniel, Ronald (ELS-SDG)
Any convenient tool to do this [sparse vector product] in Spark? Unfortunately, it seems that there are very few operations defined for sparse vectors. I needed to add some, and ended up converting them to (dense) numpy vectors and doing the addition on those. Best regards, Ron From: Xi

sparse vector operations in Python

2015-03-09 Thread Daniel, Ronald (ELS-SDG)
Hi, Sorry to ask this, but how do I compute the sum of 2 (or more) mllib SparseVectors in Python? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

partions, SQL tables, and Parquet I/O

2015-02-26 Thread Daniel, Ronald (ELS-SDG)
Short story: I want to write some parquet files so they are pre-partitioned by the same key. Then, when I read them back in, joining the two tables on that key should be about as fast as things can be done. Can I do that, and if so, how? I don't see how to control the partitioning of a SQL

RE: Using TF-IDF from MLlib

2014-11-21 Thread Daniel, Ronald (ELS-SDG)
Thanks for the info Andy. A big help. One thing - I think you can figure out which document is responsible for which vector without checking in more code. Start with a PairRDD of [doc_id, doc_string] for each document and split that into one RDD for each column. The values in the doc_string RDD

Using TF-IDF from MLlib

2014-11-20 Thread Daniel, Ronald (ELS-SDG)
Hi all, I want to try the TF-IDF functionality in MLlib. I can feed it words and generate the tf and idf RDD[Vector]s, using the code below. But how do I get this back to words and their counts and tf-idf values for presentation? val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)

RE: filtering a SchemaRDD

2014-11-16 Thread Daniel, Ronald (ELS-SDG)
Indeed it did. Thanks! Ron From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, November 14, 2014 9:53 PM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: filtering a SchemaRDD If I use row[6] instead of row[text] I get what I am looking for. However

filtering a SchemaRDD

2014-11-14 Thread Daniel, Ronald (ELS-SDG)
Hi all, I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one of which holds the text for a sentence from a document. # Load sentence data table sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing') sentenceRDD.take(3) Out[20]: [Row(annotID=118,

Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile(SomeArticle.txt) Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N...as shown

RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote: Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile(SomeArticle.txt) Also assume that the sentence breaks

RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Thanks Xiangrui, that looks very helpful. Best regards, Ron -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Wednesday, September 03, 2014 1:19 PM To: Daniel, Ronald (ELS-SDG) Cc: Victor Tso-Guillen; user@spark.apache.org Subject: Re: Accessing neighboring

RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Just to point out that the benchmark you point to has Redshift running on HDD machines instead of SSD, and it is still faster than Shark in all but one case. Like Gary, I'm also interested in replacing something we have on Redshift with Spark SQL, as it will give me much greater capability to

RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
From: Gary Malouf [mailto:malouf.g...@gmail.com] Sent: Wednesday, August 06, 2014 1:17 PM To: Nicholas Chammas Cc: Daniel, Ronald (ELS-SDG); user@spark.apache.org Subject: Re: Regarding tooling/performance vs RedShift Also, regarding something like redshift not having MLlib built in, much

Column width limits?

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Assume I want to make a PairRDD whose keys are S3 URLs and whose values are Strings holding the contents of those (UTF-8) files, but NOT split into lines. Are there length limits on those files/Strings? 1 MB? 16 MB? 4 GB? 1 TB? Similarly, can such a thing be registered as a table so that I can