RE: Hash Partitioning and Dataframes

2015-05-08 Thread Daniel, Ronald (ELS-SDG)
Just trying to make sure that something I know in advance (the joins will 
always have an equality test on one specific field) is used to optimize the 
partitioning so the joins only use local data.

Thanks for the info.

Ron


From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, May 08, 2015 3:15 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: Hash Partitioning and Dataframes

What are you trying to accomplish?  Internally Spark SQL will add Exchange 
operators to make sure that data is partitioned correctly for joins and 
aggregations.  If you are going to do other RDD operations on the result of 
dataframe operations and you need to manually control the partitioning, call 
df.rdd and partition as you normally would.

On Fri, May 8, 2015 at 2:47 PM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote:
Hi,

How can I ensure that a batch of DataFrames I make are all partitioned based on 
the value of one column common to them all?
For RDDs I would partitionBy a HashPartitioner, but I don't see that in the 
DataFrame API.
If I partition the RDDs that way, then do a toDF(), will the partitioning be 
preserved?

Thanks,
Ron



RE: How to do spares vector product in Spark?

2015-03-13 Thread Daniel, Ronald (ELS-SDG)
 Any convenient tool to do this [sparse vector product] in Spark?

Unfortunately, it seems that there are very few operations defined for sparse 
vectors. I needed to add some, and ended up converting them to (dense) numpy 
vectors and doing the addition on those.

Best regards,
Ron


From: Xi Shen [mailto:davidshe...@gmail.com]
Sent: Friday, March 13, 2015 1:50 AM
To: user@spark.apache.org
Subject: How to do spares vector product in Spark?

Hi,

I have two RDD[Vector], both Vector are spares and of the form:

(id, value)

id indicates the position of the value in the vector space. I want to apply 
dot product on two of such RDD[Vector] and get a scale value. The none exist 
values are treated as zero.

Any convenient tool to do this in Spark?


Thanks,
David


sparse vector operations in Python

2015-03-09 Thread Daniel, Ronald (ELS-SDG)
Hi,

Sorry to ask this, but how do I compute the sum of 2 (or more) mllib 
SparseVectors in Python?

Thanks,
Ron


 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



partions, SQL tables, and Parquet I/O

2015-02-26 Thread Daniel, Ronald (ELS-SDG)
Short story: I want to write some parquet files so they are pre-partitioned by 
the same key. Then, when I read them
back in, joining the two tables on that key should be about as fast as things 
can be done.
Can I do that, and if so, how? I don't see how to control the partitioning of a 
SQL table, as opposed to PairRDDs.

Thanks,
Ron



RE: Using TF-IDF from MLlib

2014-11-21 Thread Daniel, Ronald (ELS-SDG)
Thanks for the info Andy. A big help.

One thing - I think you can figure out which document is responsible for which 
vector without checking in more code.
Start with a PairRDD of [doc_id, doc_string] for each document and split that 
into one RDD for each column.
The values in the doc_string RDD get split and turned into a Seq and fed to 
TFIDF.
You can take the resulting RDD[Vector]s and zip them with the doc_id RDD. 
Presto!

Best regards,
Ron





Using TF-IDF from MLlib

2014-11-20 Thread Daniel, Ronald (ELS-SDG)
Hi all,

I want to try the TF-IDF functionality in MLlib.
I can feed it words and generate the tf and idf  RDD[Vector]s, using the code 
below.
But how do I get this back to words and their counts and tf-idf values for 
presentation?


val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)
val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split( ).toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

It looks like I can get the indices of the terms using something like

J = wordListRDD.map(w = hashingTF.indexOf(w))

where wordList is an RDD holding the distinct words from the sequence of words 
used to come up with tf.
But how do I do the equivalent of

Counts  = J.map(j = tf.counts(j))  ?

Thanks,
Ron



RE: filtering a SchemaRDD

2014-11-16 Thread Daniel, Ronald (ELS-SDG)
Indeed it did. Thanks!

Ron


From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, November 14, 2014 9:53 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: filtering a SchemaRDD


If I use row[6] instead of row[text] I get what I am looking for. However, 
finding the right numeric index could be a pain.

Can I access the fields in a Row of a SchemaRDD by name, so that I can map, 
filter, etc. without a trial and error process of finding the right int for the 
fieldname?

row.text should work.

More examples here: 
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#tab_python_2

Michael


filtering a SchemaRDD

2014-11-14 Thread Daniel, Ronald (ELS-SDG)
Hi all,

I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one 
of which holds the text for a sentence from a document.

  # Load sentence data table
  sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing')
  sentenceRDD.take(3)
Out[20]: [Row(annotID=118, annotSet=u'ge', annotType=u'sentence', 
endOffset=20194, pii=u'0094576587900440', startOffset=20062, text=u'Paper 
IAF-86-85 presented at the 37th Congress of the International Astronautical 
Federation, Innsbruck, Austria, 4-11 October 1986.'), Row(annotID=163, 
annotSet=u'ge', annotType=u'sentence', endOffset=20249, 
pii=u'0094576587900440', startOffset=20194, text=uThe landsat sensors: Eosat's 
plans for landsats 6 and 7), Row(annotID=190, annotSet=u'ge', 
annotType=u'sentence', endOffset=20342, pii=u'0094576587900440', 
startOffset=20334, text=u'Abstract')]

I have this registered as a table and can query it with SQL select statments. I 
would also like to filter the RDD using text operations like regexps that have 
greated capabilities than SQL's LIKE operator. However, the code below does not 
work. Instead I get a runtime error.

openProbsRDD = sentenceRDD.filter(lambda row: remains unknown in 
row[text] )
openProbsRDD.take(5)
...
TypeError: tuple indices must be integers, not str
...

If I use row[6] instead of row[text] I get what I am looking for. However, 
finding the right numeric index could be a pain.

Can I access the fields in a Row of a SchemaRDD by name, so that I can map, 
filter, etc. without a trial and error process of finding the right int for the 
fieldname?

Thanks,
Ron Daniel


Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Hi all,

Assume I have read the lines of a text file into an RDD:

textFile = sc.textFile(SomeArticle.txt)

Also assume that the sentence breaks in SomeArticle.txt were done by machine 
and have some errors, such as the break at Fig. in the sample text below.

Index   Text
N...as shown in Fig.
N+1 1.
N+2 The figure shows...

What I want is an RDD with:

N   ... as shown in Fig. 1.
N+1 The figure shows...

Is there some way a filter() can look at neighboring elements in an RDD? That 
way I could look, in parallel, at neighboring elements in an RDD and come up 
with a new RDD that may have a different number of elements.  Or do I just have 
to sequentially iterate through the RDD?

Thanks,
Ron




RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Thanks for the pointer to that thread. Looks like there is some demand for this 
capability, but not a lot yet. Also doesn't look like there is an easy answer 
right now.

Thanks,
Ron


From: Victor Tso-Guillen [mailto:v...@paxata.com]
Sent: Wednesday, September 03, 2014 10:40 AM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: Accessing neighboring elements in an RDD

Interestingly, there was an almost identical question posed on Aug 22 by 
cjwang. Here's the link to the archive: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664

On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote:
Hi all,

Assume I have read the lines of a text file into an RDD:

textFile = sc.textFile(SomeArticle.txt)

Also assume that the sentence breaks in SomeArticle.txt were done by machine 
and have some errors, such as the break at Fig. in the sample text below.

Index   Text
N...as shown in Fig.
N+1 1.
N+2 The figure shows...

What I want is an RDD with:

N   ... as shown in Fig. 1.
N+1 The figure shows...

Is there some way a filter() can look at neighboring elements in an RDD? That 
way I could look, in parallel, at neighboring elements in an RDD and come up 
with a new RDD that may have a different number of elements.  Or do I just have 
to sequentially iterate through the RDD?

Thanks,
Ron




RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Thanks Xiangrui, that looks very helpful.

Best regards,
Ron


 -Original Message-
 From: Xiangrui Meng [mailto:men...@gmail.com]
 Sent: Wednesday, September 03, 2014 1:19 PM
 To: Daniel, Ronald (ELS-SDG)
 Cc: Victor Tso-Guillen; user@spark.apache.org
 Subject: Re: Accessing neighboring elements in an RDD
 
 There is a sliding method implemented in MLlib
 (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
 pache/spark/mllib/rdd/SlidingRDD.scala),
 which is used in computing Area Under Curve:
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
 pache/spark/mllib/evaluation/AreaUnderCurve.scala#L45
 
 With it, you can process neighbor lines by
 
 rdd.sliding(3).map { case Seq(l0, l1, l2) = ... }
 
 -Xiangrui
 
 On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG)
 r.dan...@elsevier.com wrote:
  Thanks for the pointer to that thread. Looks like there is some demand
  for this capability, but not a lot yet. Also doesn't look like there
  is an easy answer right now.
 
 
 
  Thanks,
 
  Ron
 
 
 
 
 
  From: Victor Tso-Guillen [mailto:v...@paxata.com]
  Sent: Wednesday, September 03, 2014 10:40 AM
  To: Daniel, Ronald (ELS-SDG)
  Cc: user@spark.apache.org
  Subject: Re: Accessing neighboring elements in an RDD
 
 
 
  Interestingly, there was an almost identical question posed on Aug 22
  by cjwang. Here's the link to the archive:
  http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-a
  nd-next-element-in-a-sorted-RDD-td12621.html#a12664
 
 
 
  On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
  r.dan...@elsevier.com wrote:
 
  Hi all,
 
  Assume I have read the lines of a text file into an RDD:
 
  textFile = sc.textFile(SomeArticle.txt)
 
  Also assume that the sentence breaks in SomeArticle.txt were done by
  machine and have some errors, such as the break at Fig. in the sample text
 below.
 
  Index   Text
  N...as shown in Fig.
  N+1 1.
  N+2 The figure shows...
 
  What I want is an RDD with:
 
  N   ... as shown in Fig. 1.
  N+1 The figure shows...
 
  Is there some way a filter() can look at neighboring elements in an RDD?
  That way I could look, in parallel, at neighboring elements in an RDD
  and come up with a new RDD that may have a different number of
  elements.  Or do I just have to sequentially iterate through the RDD?
 
  Thanks,
  Ron
 
 


RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Just to point out that the benchmark you point to has Redshift running on HDD 
machines instead of SSD, and it is still faster than Shark in all but one case.

Like Gary, I'm also interested in replacing something we have on Redshift with 
Spark SQL, as it will give me much greater capability to process things. I'm 
willing to sacrifice some performance for the greater capability. But it would 
be nice to see the benchmark updated with Spark SQL, and with a more 
competitive configuration of Redshift.

Best regards, and keep up the great work!

Ron


From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
Sent: Wednesday, August 06, 2014 9:30 AM
To: Gary Malouf
Cc: user
Subject: Re: Regarding tooling/performance vs RedShift

1) We get tooling out of the box from RedShift (specifically, stable JDBC 
access) - Spark we often are waiting for devops to get the right combo of tools 
working or for libraries to support sequence files.

The arguments about JDBC access and simpler setup definitely make sense. My 
first non-trivial Spark application was actually an ETL process that sliced and 
diced JSON + tabular data and then loaded it into Redshift. From there on you 
got all the benefits of your average C-store database, plus the added benefit 
of Amazon managing many annoying setup and admin details for your Redshift 
cluster.

One area I'm looking forward to seeing Spark SQL excel at is offering fast JDBC 
access to raw data--i.e. directly against S3 / HDFS; no ETL required. For 
easy and flexible data exploration, I don't think you can beat that with a 
C-store that you have to ETL stuff into.

2) There is a belief that for many of our queries (assumed to often be joins) a 
columnar database will perform orders of magnitude better.

This is definitely a it depends statement, but there is a detailed benchmark 
herehttps://amplab.cs.berkeley.edu/benchmark/ comparing Shark, Redshift, and 
other systems. Have you seen it? Redshift does very well, but Shark is on par 
or better than it in most of the tests. Of course, going forward we'll want to 
see Spark SQL match this kind of performance, and that remains to be seen.

Nick


On Wed, Aug 6, 2014 at 12:06 PM, Gary Malouf 
malouf.g...@gmail.commailto:malouf.g...@gmail.com wrote:
My company is leaning towards moving much of their analytics work from our own 
Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been the 
internal advocate for using Spark for analytics, but a number of good points 
have been brought up to me.  The reasons being pushed are:

- RedShift exposes a jdbc interface out of the box (no devops work there) and 
data looks and feels like it is in a normal sql database.  They want this out 
of the box from Spark, no trying to figure out which version matches this 
version of Hive/Shark/SparkSQL etc.  Yes, the next release theoretically 
supports this but there have been release issues our team has battled to date 
that erode the trust.

- Complaints around challenges we have faced running a spark shell locally 
against a cluster in EC2.  It is partly a devops issue of deploying the correct 
configurations to local machines, being able to kick a user off hogging RAM, 
etc.

- I want to be able to run queries from my python shell against your sequence 
file data, roll it up and in the same shell leverage python graph tools.  - 
I'm not very familiar with the Python setup, but I believe by being able to run 
locally AND somehow add custom libraries to be accessed from PySpark this could 
be done.

- Joins will perform much better (in RedShift) because it says it sorts it's 
keys.  We cannot pre-compute all joins away.


Basically, their argument is two-fold:

1) We get tooling out of the box from RedShift (specifically, stable JDBC 
access) - Spark we often are waiting for devops to get the right combo of tools 
working or for libraries to support sequence files.

2) There is a belief that for many of our queries (assumed to often be joins) a 
columnar database will perform orders of magnitude better.



Anyway, a test is being setup to compare the two on the performance side but 
from a tools perspective it's hard to counter the issues that are brought up.



RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Well yes, MLlib-like routines or pretty much anything else could be run on the 
derived results, but you have to unload the results from Redshift and then load 
them into some other tool. So it's nicer to leave them in memory and operate on 
them there. Major architectural advantage to Spark.

Ron


From: Gary Malouf [mailto:malouf.g...@gmail.com]
Sent: Wednesday, August 06, 2014 1:17 PM
To: Nicholas Chammas
Cc: Daniel, Ronald (ELS-SDG); user@spark.apache.org
Subject: Re: Regarding tooling/performance vs RedShift


Also, regarding something like redshift not having MLlib built in, much of that 
could be done on the derived results.
On Aug 6, 2014 4:07 PM, Nicholas Chammas 
nicholas.cham...@gmail.commailto:nicholas.cham...@gmail.com wrote:
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald 
(ELS-SDG)r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote:
Mostly I was just objecting to  Redshift does very well, but Shark is on par 
or better than it in most of the tests  when that was not how I read the 
results, and Redshift was on HDDs.

My bad. You are correct; the only test Shark (mem) does better on is test #1 
Scan Query.

And indeed, it would be good to see an updated benchmark with Redshift running 
on SSDs.

Nick


Column width limits?

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Assume I want to make a PairRDD whose keys are S3 URLs and whose values are 
Strings holding the contents of those (UTF-8) files, but NOT split into lines. 
Are there length limits on those files/Strings? 1 MB? 16 MB? 4 GB? 1 TB?
Similarly, can such a thing be registered as a table so that I can use substr() 
to pick out pieces of the string?

Thanks,
Ron