Re: Grouping runs of elements in a RDD

2015-07-02 Thread RJ Nowling
Thanks, Mohit.  It sounds like we're on the same page -- I used a similar
approach.

On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 if you are joining successive lines together based on a predicate, then
 you are doing a flatMap not an aggregate. you are on the right track
 with a multi-pass solution. i had the same challenge when i needed a
 sliding window over an RDD(see below).

 [ i had suggested that the sliding window API be moved to spark-core. not
 sure if that happened ]

 - previous posts ---


 http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions

  On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi mohitja...@gmail.com
  wrote:
 
 
  http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
 
  you can use the MLLib function or do the following (which is what I had
  done):
 
  - in first pass over the data, using mapPartitionWithIndex, gather the
  first item in each partition. you can use collect (or aggregator) for this.
  “key” them by the partition index. at the end, you will have a map
 (partition index) -- first item
  - in the second pass over the data, using mapPartitionWithIndex again,
  look at two (or in the general case N items at a time, you can use scala’s
  sliding iterator) items at a time and check the time difference(or any
  sliding window computation). To this mapParitition, pass the map created in
  previous step. You will need to use them to check the last item in this
  partition.
 
  If you can tolerate a few inaccuracies then you can just do the second
  step. You will miss the “boundaries” of the partitions but it might be
  acceptable for your use case.


 On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling rnowl...@gmail.com wrote:

 That's an interesting idea!  I hadn't considered that.  However, looking
 at the Partitioner interface, I would need to know from looking at a single
 key which doesn't fit my case, unfortunately.  For my case, I need to
 compare successive pairs of keys.  (I'm trying to re-join lines that were
 split prematurely.)

 On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh 
 abhis...@tetrationanalytics.com wrote:

 could you use a custom partitioner to preserve boundaries such that all
 related tuples end up on the same partition?

 On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Reynold.  I still need to handle incomplete groups that fall
 between partition boundaries. So, I need a two-pass approach. I came up
 with a somewhat hacky way to handle those using the partition indices and
 key-value pairs as a second pass after the first.

 OCaml's std library provides a function called group() that takes a
 break function that operators on pairs of successive elements.  It seems a
 similar approach could be used in Spark and would be more efficient than my
 approach with key-value pairs since you know the ordering of the partitions.

 Has this need been expressed by others?

 On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com
 wrote:

 Try mapPartitions, which gives you an iterator, and you can produce an
 iterator back.


 On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com
 wrote:

 Hi all,

 I have a problem where I have a RDD of elements:

 Item1 Item2 Item3 Item4 Item5 Item6 ...

 and I want to run a function over them to decide which runs of
 elements to group together:

 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...

 Technically, I could use aggregate to do this, but I would have to use
 a List of List of T which would produce a very large collection in memory.

 Is there an easy way to accomplish this?  e.g.,, it would be nice to
 have a version of aggregate where the combination function can return a
 complete group that is added to the new RDD and an incomplete group which
 is passed to the next call of the reduce function.

 Thanks,
 RJ









Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
Hi all,

I have a problem where I have a RDD of elements:

Item1 Item2 Item3 Item4 Item5 Item6 ...

and I want to run a function over them to decide which runs of elements to
group together:

[Item1 Item2] [Item3] [Item4 Item5 Item6] ...

Technically, I could use aggregate to do this, but I would have to use a
List of List of T which would produce a very large collection in memory.

Is there an easy way to accomplish this?  e.g.,, it would be nice to have a
version of aggregate where the combination function can return a complete
group that is added to the new RDD and an incomplete group which is passed
to the next call of the reduce function.

Thanks,
RJ


Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
That's an interesting idea!  I hadn't considered that.  However, looking at
the Partitioner interface, I would need to know from looking at a single
key which doesn't fit my case, unfortunately.  For my case, I need to
compare successive pairs of keys.  (I'm trying to re-join lines that were
split prematurely.)

On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh 
abhis...@tetrationanalytics.com wrote:

 could you use a custom partitioner to preserve boundaries such that all
 related tuples end up on the same partition?

 On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks, Reynold.  I still need to handle incomplete groups that fall
 between partition boundaries. So, I need a two-pass approach. I came up
 with a somewhat hacky way to handle those using the partition indices and
 key-value pairs as a second pass after the first.

 OCaml's std library provides a function called group() that takes a break
 function that operators on pairs of successive elements.  It seems a
 similar approach could be used in Spark and would be more efficient than my
 approach with key-value pairs since you know the ordering of the partitions.

 Has this need been expressed by others?

 On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote:

 Try mapPartitions, which gives you an iterator, and you can produce an
 iterator back.


 On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I have a problem where I have a RDD of elements:

 Item1 Item2 Item3 Item4 Item5 Item6 ...

 and I want to run a function over them to decide which runs of elements
 to group together:

 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...

 Technically, I could use aggregate to do this, but I would have to use a
 List of List of T which would produce a very large collection in memory.

 Is there an easy way to accomplish this?  e.g.,, it would be nice to
 have a version of aggregate where the combination function can return a
 complete group that is added to the new RDD and an incomplete group which
 is passed to the next call of the reduce function.

 Thanks,
 RJ







Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
Thanks, Reynold.  I still need to handle incomplete groups that fall
between partition boundaries. So, I need a two-pass approach. I came up
with a somewhat hacky way to handle those using the partition indices and
key-value pairs as a second pass after the first.

OCaml's std library provides a function called group() that takes a break
function that operators on pairs of successive elements.  It seems a
similar approach could be used in Spark and would be more efficient than my
approach with key-value pairs since you know the ordering of the partitions.

Has this need been expressed by others?

On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote:

 Try mapPartitions, which gives you an iterator, and you can produce an
 iterator back.


 On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I have a problem where I have a RDD of elements:

 Item1 Item2 Item3 Item4 Item5 Item6 ...

 and I want to run a function over them to decide which runs of elements
 to group together:

 [Item1 Item2] [Item3] [Item4 Item5 Item6] ...

 Technically, I could use aggregate to do this, but I would have to use a
 List of List of T which would produce a very large collection in memory.

 Is there an easy way to accomplish this?  e.g.,, it would be nice to have
 a version of aggregate where the combination function can return a complete
 group that is added to the new RDD and an incomplete group which is passed
 to the next call of the reduce function.

 Thanks,
 RJ





Best way to store RDD data?

2014-11-20 Thread RJ Nowling
Hi all,

I'm working on an application that has several tables (RDDs of tuples) of
data. Some of the types are complex-ish (e.g., date time objects). I'd like
to use something like case classes for each entry.

What is the best way to store the data to disk in a text format without
writing custom parsers?  E.g., serializing case classes to/from JSON.

What are other users doing?

Thanks!
RJ


-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Multitenancy in Spark - within/across spark context

2014-10-25 Thread RJ Nowling
Ashwin,

What is your motivation for needing to share RDDs between jobs? Optimizing
for reusing data across jobs?

If so, you may want to look into Tachyon. My understanding is that Tachyon
acts like a caching layer and you can designate when data will be reused in
multiple jobs so it know to keep that in memory or local disk for faster
access. But my knowledge of tachyon is second hand so forgive me if I have
it wrong :)

RJ

On Friday, October 24, 2014, Evan Chan velvia.git...@gmail.com wrote:

 Ashwin,

 I would say the strategies in general are:

 1) Have each user submit separate Spark app (each its own Spark
 Context), with its own resource settings, and share data through HDFS
 or something like Tachyon for speed.

 2) Share a single spark context amongst multiple users, using fair
 scheduler.  This is sort of like having a Hadoop resource pool.It
 has some obvious HA/SPOF issues, namely that if the context dies then
 every user using it is also dead.   Also, sharing RDDs in cached
 memory has the same resiliency problems, namely that if any executor
 dies then Spark must recompute / rebuild the RDD (it tries to only
 rebuild the missing part, but sometimes it must rebuild everything).

 Job server can help with 1 or 2, 2 in particular.  If you have any
 questions about job server, feel free to ask at the spark-jobserver
 google group.   I am the maintainer.

 -Evan


 On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin van...@cloudera.com
 javascript:; wrote:
  You may want to take a look at
 https://issues.apache.org/jira/browse/SPARK-3174.
 
  On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com
 javascript:; wrote:
  Upvote for the multitanency requirement.
 
  I'm also building a data analytic platform and there'll be multiple
 users
  running queries and computations simultaneously. One of the paint point
 is
  control of resource size. Users don't really know how much nodes they
 need,
  they always use as much as possible... The result is lots of wasted
 resource
  in our Yarn cluster.
 
  A way to 1) allow multiple spark context to share the same resource or
 2)
  add dynamic resource management for Yarn mode is very much wanted.
 
  Jianshi
 
  On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com
 javascript:; wrote:
 
  On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
  ashwinshanka...@gmail.com javascript:; wrote:
   That's not something you might want to do usually. In general, a
   SparkContext maps to a user application
  
   My question was basically this. In this page in the official doc,
 under
   Scheduling within an application section, it talks about multiuser
 and
   fair sharing within an app. How does multiuser within an application
   work(how users connect to an app,run their stuff) ? When would I
 want to
   use
   this ?
 
  I see. The way I read that page is that Spark supports all those
  scheduling options; but Spark doesn't give you the means to actually
  be able to submit jobs from different users to a running SparkContext
  hosted on a different process. For that, you'll need something like
  the job server that I referenced before, or write your own framework
  for supporting that.
 
  Personally, I'd use the information on that page when dealing with
  concurrent jobs in the same SparkContext, but still restricted to the
  same user. I'd avoid trying to create any application where a single
  SparkContext is trying to be shared by multiple users in any way.
 
   As far as I understand, this will cause executors to be killed,
 which
   means that Spark will start retrying tasks to rebuild the data that
   was held by those executors when needed.
  
   I basically wanted to find out if there were any gotchas related to
   preemption on Spark. Things like say half of an application's
 executors
   got
   preempted say while doing reduceByKey, will the application progress
   with
   the remaining resources/fair share ?
 
  Jobs should still make progress as long as at least one executor is
  available. The gotcha would be the one I mentioned, where Spark will
  fail your job after x executors failed, which might be a common
  occurrence when preemption is enabled. That being said, it's a
  configurable option, so you can set x to a very large value and your
  job should keep on chugging along.
 
  The options you'd want to take a look at are: spark.task.maxFailures
  and spark.yarn.max.executor.failures
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 javascript:;
  For additional commands, e-mail: user-h...@spark.apache.org
 javascript:;
 
 
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 
 
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
  For additional 

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-19 Thread RJ Nowling
Jatin,

If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.

RJ

On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Jatin,

 HashingTF should be able to solve the memory problem if you use a
 small feature dimension in HashingTF. Please do not cache the input
 document, but cache the output from HashingTF and IDF instead. We
 don't have a label indexer yet, so you need a label to index map to
 map it to double values, e.g., D1 - 0.0, D2 - 1.0, etc. Assuming
 that the input is an RDD[(label: String, doc: Seq[String])], the code
 should look like the following:

 val docTypeToLabel = Map(D1 - 0.0, ...)
 val tf = new HashingTF();
 val freqs = input.map(x = (docTypeToLabel(x._1),
 tf.transform(x._2))).cache()
 val idf = new IDF()
 val idfModel = idf.fit(freqs.values)
 val vectors = freqs.map(x = LabeledPoint(x._1, idfModel.transform(x._2)))
 val nbModel = NaiveBayes.train(vectors)

 IDF doesn't provide the filter on the min occurrence, but it is nice
 to put that option. Please create a JIRA and someone may work on it.

 Best,
 Xiangrui


 On Thu, Sep 18, 2014 at 3:46 AM, jatinpreet jatinpr...@gmail.com wrote:
  Hi,
 
  I have been running into memory overflow issues while creating TFIDF
 vectors
  to be used in document classification using MLlib's Naive Baye's
  classification implementation.
 
 
 http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
 
  Memory overflow and GC issues occur while collecting idfs for all the
 terms.
  To give an idea of scale, I am reading around 615,000(around 4GB of text
  data) small sized documents from HBase  and running the spark program
 with 8
  cores and 6GB of executor memory. I have tried increasing the parallelism
  level and shuffle memory fraction but to no avail.
 
  The new TFIDF generation APIs caught my eye in the latest Spark version
  1.1.0. The example given in the official documentation mentions creation
 of
  TFIDF vectors based of Hashing Trick. I want to know if it will solve the
  mentioned problem by benefiting from reduced memory consumption.
 
  Also, the example does not state how to create labeled points for a
 corpus
  of pre-classified document data. For example, my training input looks
  something like this,
 
  DocumentType  |  Content
  -
  D1   |  This is Doc1 sample.
  D1   |  This also belongs to Doc1.
  D1   |  Yet another Doc1 sample.
  D2   |  Doc2 sample.
  D2   |  Sample content for Doc2.
  D3   |  The only sample for Doc3.
  D4   |  Doc4 sample looks like this.
  D4   |  This is Doc4 sample content.
 
  I want to create labeled points from this sample data for training. And
 once
  the Naive Bayes model is created, I generate TFIDFs for the test
 documents
  and predict the document type.
 
  If the new API can solve my issue, how can I generate labelled points
 using
  the new APIs? An example would be great.
 
  Also, I have a special requirement of ignoring terms that occur in less
 than
  two documents. This has important implications for the accuracy of my use
  case and needs to be accommodated while generating TFIDFs.
 
  Thanks,
  Jatin
 
 
 
  -
  Novice Big Data Programmer
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
em rnowl...@gmail.com
c 954.496.2314