Re: Design patterns for Spark implementation

2016-12-08 Thread Peter Figliozzi
Keeping in mind Spark is a parallel computing engine, Spark does not change your data infrastructure/data architecture. These days it's relatively convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) and ditto on the output side. For example, for one of my use-cases, I sto

Re: Parsing XML

2016-10-04 Thread Peter Figliozzi
It's pretty clear that df.col(xpath) is looking for a column named xpath in your df, not executing an xpath over an XML document as you wish. Try constructing a UDF which applies your xpath query, and give that as the second argument to withColumn. On Tue, Oct 4, 2016 at 4:35 PM, Jean Georges Per

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Peter Figliozzi
It's a good question. People have been publishing papers on decision trees and various methods of constructing and pruning them for over 30 years. I think it's rather a question for a historian at this point. On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty wrote: > Read this explanation but

Re: Treadting NaN fields in Spark

2016-09-29 Thread Peter Figliozzi
"isnan" ends up using a case class, subclass of UnaryExpression, called "IsNaN" which evaluates each row of the column like this: - *False* if the value is Null - Check the "Expression.Type" (apparently a Spark thing, not a Scala thing.. still learning here) - DoubleType: cast to Doub

Re: Treadting NaN fields in Spark

2016-09-28 Thread Peter Figliozzi
In Scala, x.isNaN returns true for Double.NaN, but false for any character. I guess the `isnan` function you are using works by ultimately looking at x.isNan. On Wed, Sep 28, 2016 at 5:56 AM, Mich Talebzadeh wrote: > > This is an issue in most databases. Specifically if a field is NaN.. --> ( >

Re: read multiple files

2016-09-27 Thread Peter Figliozzi
If you're up for a fancy but excellent solution: - Store your data in Cassandra. - Use the expiring data feature (TTL) so data will automatically be removed a month later. - Now in your Spark process, just read f

median of groups

2016-09-26 Thread Peter Figliozzi
I'm trying to figure out a nice way to get the median of a DataFrame column *once it is grouped. * It's easy enough now to get the min, max, mean, and other things that are part of spark.sql.functions: df.groupBy("foo", "bar").agg(mean($"column1")) And it's easy enough to get the median of a co

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
tructure. Note that repartitioning > is explicit shuffle. > > If you want to have only single file you need to repartition the whole RDD > to single partition. > Depending on the result data size it may be something that you want or do > not want to do ;-) > > Regards, >

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
ark will write the result files for each partition (in the > worker which > holds it) and complete operation writing the _SUCCESS in the driver node. > > Cheers, > Piotr > > > On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi > wrote: > >> Both >> >> df.

Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Peter Figliozzi
Both df.write.csv("/path/to/foo") and df.write.format("com.databricks.spark.csv").save("/path/to/foo") results in a *blank* file called "_SUCCESS" under /path/to/foo. My df has stuff in it.. tried this with both my real df, and a quick df constructed from literals. Why isn't it writing anythi

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Peter Figliozzi
t would be interesting is to see how much > time each task/job/stage takes. > > On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi > wrote: > >> It seems to me they must communicate for joins, sorts, grouping, and so >> forth, where the original data partitioning needs to cha

Re: Is executor computing time affected by network latency?

2016-09-22 Thread Peter Figliozzi
It seems to me they must communicate for joins, sorts, grouping, and so forth, where the original data partitioning needs to change. You could repeat your experiment for different code snippets. I'll bet it depends on what you do. On Thu, Sep 22, 2016 at 8:54 AM, gusiri wrote: > Hi, > > When I

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Peter Figliozzi
I'm sure there's another way to do it; I hope someone can show us. I couldn't figure out how to use `map` either. On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) wrote: > Thanks, Peter. > It works! > > Why udf is needed? > > > > > On Wed,

Re: Similar Items

2016-09-20 Thread Peter Figliozzi
Related question: is there anything that does scalable matrix multiplication on Spark? For example, we have that long list of vectors and want to construct the similarity matrix: v * T(v). In R it would be: v %*% t(v) Thanks, Pete On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott wrote: > Hi a

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-20 Thread Peter Figliozzi
> classes) are supported by importing spark.implicits._ Support for > serializing other types will be added in future releases. > dataStr.map(row => Vectors.parse(row.getString(1))) > > > Dose anyone can help me, > thanks very much! > > > > > > >

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
der? > > > On Thu, Sep 8, 2016 at 11:26 AM, Peter Figliozzi > wrote: > >> All (three) of them. It's kind of cool-- when I re-run collect() a different >> executor will show up as first to encounter the error. >> >> On Wed, Sep 7, 2016 at 8:20 PM, ayan gu

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
All (three) of them. It's kind of cool-- when I re-run collect() a different executor will show up as first to encounter the error. On Wed, Sep 7, 2016 at 8:20 PM, ayan guha wrote: > Hi > > Is it happening on all executors or one? > > On Thu, Sep 8, 2016 at 10:46 AM, Pete

Fwd: distribute work (files)

2016-09-07 Thread Peter Figliozzi
eNotFoundException? > > > Please paste the stacktrace here. > > > Yong > > > -- > *From:* Peter Figliozzi > *Sent:* Wednesday, September 7, 2016 10:18 AM > *To:* ayan guha > *Cc:* Lydia Ickler; user.spark > *Subject:* Re: distribu

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
ith a wildcard Thanks, Pete On Tue, Sep 6, 2016 at 11:20 PM, ayan guha wrote: > To access local file, try with file:// URI. > > On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi > wrote: > >> This is a great question. Basically you don't have to worry about the >> detail

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-07 Thread Peter Figliozzi
ply and filter function. > > Do spark have some more detailed document? > > > > On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi > wrote: > >> Hi Yan, I think you'll have to map the features column to a new numerical >> features column. >> >> Her

Re: distribute work (files)

2016-09-06 Thread Peter Figliozzi
This is a great question. Basically you don't have to worry about the details-- just give a wildcard in your call to textFile. See the Programming Guide section entitled "External Datasets". The Spark framework will distribute your dat

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Peter Figliozzi
Hi Yan, I think you'll have to map the features column to a new numerical features column. Here's one way to do the individual transform: scala> val x = "[1, 2, 3, 4, 5]" x: String = [1, 2, 3, 4, 5] scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") split(" ") map(_.toInt) y: Ar

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework. There are many ways to give it data to chomp down on. If you don't know why you would need HDFS, then you don't need it. Same goes for Zookeeper. Spark works fine without either. Much of what we read online comes from people with specialized problems an