Re: Going it alone.

2020-04-15 Thread Matt Smith
This is so entertaining.

1. Ask for help
2. Compare those you need help from to a lower order primate.
3. Claim you provided information you did not
4. Explain that providing any information would be "too revealing"
5. ???

Can't wait to hear what comes next, but please keep it up.  This is a
bright spot in my day.


On Tue, Apr 14, 2020 at 4:47 PM jane thorpe 
wrote:

> I did write a long email in response to you.
> But then I deleted it because I felt it would be too revealing.
>
>
>
>
>
> --
> On Tuesday, 14 April 2020 David Hesson  wrote:
>
> I want to know  if Spark is headed in my direction.
>
> You are implying  Spark could be.
>
>
> What direction are you headed in, exactly? I don't feel as if anything
> were implied when you were asked for use cases or what problem you are
> solving. You were asked to identify some use cases, of which you don't
> appear to have any.
>
> On Tue, Apr 14, 2020 at 4:49 PM jane thorpe 
> wrote:
>
> That's what  I want to know,  Use Cases.
> I am looking for  direction as I described and I want to know  if Spark is
> headed in my direction.
>
> You are implying  Spark could be.
>
> So tell me about the USE CASES and I'll do the rest.
> --
> On Tuesday, 14 April 2020 yeikel valdes  wrote:
> It depends on your use case. What are you trying to solve?
>
>
>  On Tue, 14 Apr 2020 15:36:50 -0400 * janethor...@aol.com.INVALID *
> wrote 
>
> Hi,
>
> I consider myself to be quite good in Software Development especially
> using frameworks.
>
> I like to get my hands  dirty. I have spent the last few months
> understanding modern frameworks and architectures.
>
> I am looking to invest my energy in a product where I don't have to
> relying on the monkeys which occupy this space  we call software
> development.
>
> I have found one that meets my requirements.
>
> Would Apache Spark be a good Tool for me or  do I need to be a member of a
> team to develop  products  using Apache Spark  ?
>
>
>
>
>
>


Grouping into Arrays

2016-10-24 Thread Matt Smith
I worked up the following for grouping a DataFrame by a key and aggregating
into arrays.  It works, but I think it is horrible.   Is there a better
way?  Especially one that does not require RDDs?  This is a common pattern
we need as we often want to explode JSON arrays, do something to enrich the
data, then collapse it back into a structure similar to pre-exploded, but
with the enriched data.  collect_list seems to be the pattern I am looking
for but it only works with Hive and only with primitives. Help?

thx.

  def groupToArray(df: DataFrame, groupByCols: Seq[String], arrayCol:
String): DataFrame = {
val sourceSchema = df.schema
val arrayField = StructField(arrayCol,
ArrayType(sourceSchema(arrayCol).dataType))
val groupByIndexes = groupByCols.map( colName =>
sourceSchema.fieldIndex(colName))
val arrayIndex = sourceSchema.fieldIndex(arrayCol)
val destSchema = StructType(
  groupByCols.map( colName => sourceSchema(colName)) :+
  arrayField
)
val rowRdd = df
  .rdd
  .groupBy( r => groupByIndexes.map(r(_)) )
  .map{ case (_, rowsIter) =>
  val rowValues = rowsIter.head.toSeq
  val arr = rowsIter.map { r => r(arrayIndex) }
  val keys = groupByIndexes.map( ndx => rowValues(ndx))
  Row.fromSeq(keys :+ arr)
  }

df.sqlContext.createDataFrame(rowRdd, destSchema)
  }


Iterative mapWithState

2016-08-30 Thread Matt Smith
Is is possible to use mapWithState iteratively?

In other words, I would like to keep calling mapWithState with the output
from the last mapWithState until there is no output.  For a given minibatch
mapWithState could be called anywhere from 1..200ish times depending on the
input/current state.


Spark Streaming batch sequence number

2016-08-29 Thread Matt Smith
Is it possible to get a sequence number for the current batch (ie. first
batch is 0, second is 1, etc?).