Re: Design patterns for Spark implementation

2016-12-08 Thread Peter Figliozzi
Keeping in mind Spark is a parallel computing engine, Spark does not change your data infrastructure/data architecture. These days it's relatively convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) and ditto on the output side. For example, for one of my use-cases, I

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Peter Figliozzi
It's a good question. People have been publishing papers on decision trees and various methods of constructing and pruning them for over 30 years. I think it's rather a question for a historian at this point. On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty wrote: >

Re: Treadting NaN fields in Spark

2016-09-29 Thread Peter Figliozzi
"isnan" ends up using a case class, subclass of UnaryExpression, called "IsNaN" which evaluates each row of the column like this: - *False* if the value is Null - Check the "Expression.Type" (apparently a Spark thing, not a Scala thing.. still learning here) - DoubleType: cast to

Re: Treadting NaN fields in Spark

2016-09-28 Thread Peter Figliozzi
In Scala, x.isNaN returns true for Double.NaN, but false for any character. I guess the `isnan` function you are using works by ultimately looking at x.isNan. On Wed, Sep 28, 2016 at 5:56 AM, Mich Talebzadeh wrote: > > This is an issue in most databases. Specifically

Re: read multiple files

2016-09-27 Thread Peter Figliozzi
If you're up for a fancy but excellent solution: - Store your data in Cassandra. - Use the expiring data feature (TTL) so data will automatically be removed a month later. - Now in your Spark process, just read

median of groups

2016-09-26 Thread Peter Figliozzi
I'm trying to figure out a nice way to get the median of a DataFrame column *once it is grouped. * It's easy enough now to get the min, max, mean, and other things that are part of spark.sql.functions: df.groupBy("foo", "bar").agg(mean($"column1")) And it's easy enough to get the median of a

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
repartitioning > is explicit shuffle. > > If you want to have only single file you need to repartition the whole RDD > to single partition. > Depending on the result data size it may be something that you want or do > not want to do ;-) > > Regards, > Piotr > > > &g

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
rk will write the result files for each partition (in the > worker which > holds it) and complete operation writing the _SUCCESS in the driver node. > > Cheers, > Piotr > > > On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi <pete.figlio...@gmail.com > > wrote: > >>

Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Peter Figliozzi
Both df.write.csv("/path/to/foo") and df.write.format("com.databricks.spark.csv").save("/path/to/foo") results in a *blank* file called "_SUCCESS" under /path/to/foo. My df has stuff in it.. tried this with both my real df, and a quick df constructed from literals. Why isn't it writing

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Peter Figliozzi
y. What would be interesting is to see how much > time each task/job/stage takes. > > On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi <pete.figlio...@gmail.com> > wrote: > >> It seems to me they must communicate for joins, sorts, grouping, and so >> forth, where the original

Re: Is executor computing time affected by network latency?

2016-09-22 Thread Peter Figliozzi
It seems to me they must communicate for joins, sorts, grouping, and so forth, where the original data partitioning needs to change. You could repeat your experiment for different code snippets. I'll bet it depends on what you do. On Thu, Sep 22, 2016 at 8:54 AM, gusiri

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Peter Figliozzi
I'm sure there's another way to do it; I hope someone can show us. I couldn't figure out how to use `map` either. On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > Thanks, Peter. > It works! > > Why udf is needed? > > > > > On Wed

Re: Similar Items

2016-09-20 Thread Peter Figliozzi
Related question: is there anything that does scalable matrix multiplication on Spark? For example, we have that long list of vectors and want to construct the similarity matrix: v * T(v). In R it would be: v %*% t(v) Thanks, Pete On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-20 Thread Peter Figliozzi
ng, etc) and Product types (case > classes) are supported by importing spark.implicits._ Support for > serializing other types will be added in future releases. > dataStr.map(row => Vectors.parse(row.getString(1))) > > > Dose anyone can help me, > thanks very much! > > > &

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
.textFiles to that folder? > > > On Thu, Sep 8, 2016 at 11:26 AM, Peter Figliozzi <pete.figlio...@gmail.com > > wrote: > >> All (three) of them. It's kind of cool-- when I re-run collect() a different >> executor will show up as first to encounter the erro

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1142) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> >> On Wed, Sep 7, 2016 at 9:50 AM, Yong

Fwd: distribute work (files)

2016-09-07 Thread Peter Figliozzi
leNotFoundException? > > > Please paste the stacktrace here. > > > Yong > > > -- > *From:* Peter Figliozzi <pete.figlio...@gmail.com> > *Sent:* Wednesday, September 7, 2016 10:18 AM > *To:* ayan guha > *Cc:* Lydia Ickler;

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
with a wildcard Thanks, Pete On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote: > To access local file, try with file:// URI. > > On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <pete.figlio...@gmail.com> > wrote: > >> This is a great question. Basica

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-07 Thread Peter Figliozzi
gt; use apply and filter function. > > Do spark have some more detailed document? > > > > On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <pete.figlio...@gmail.com> > wrote: > >> Hi Yan, I think you'll have to map the features column to a new numerical &g

Re: distribute work (files)

2016-09-06 Thread Peter Figliozzi
This is a great question. Basically you don't have to worry about the details-- just give a wildcard in your call to textFile. See the Programming Guide section entitled "External Datasets". The Spark framework will distribute your

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Peter Figliozzi
Hi Yan, I think you'll have to map the features column to a new numerical features column. Here's one way to do the individual transform: scala> val x = "[1, 2, 3, 4, 5]" x: String = [1, 2, 3, 4, 5] scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") split(" ") map(_.toInt) y:

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework. There are many ways to give it data to chomp down on. If you don't know why you would need HDFS, then you don't need it. Same goes for Zookeeper. Spark works fine without either. Much of what we read online comes from people with specialized problems