Re: Grouping runs of elements in a RDD
Thanks, Mohit. It sounds like we're on the same page -- I used a similar approach. On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi mohitja...@gmail.com wrote: if you are joining successive lines together based on a predicate, then you are doing a flatMap not an aggregate. you are on the right track with a multi-pass solution. i had the same challenge when i needed a sliding window over an RDD(see below). [ i had suggested that the sliding window API be moved to spark-core. not sure if that happened ] - previous posts --- http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi mohitja...@gmail.com wrote: http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E you can use the MLLib function or do the following (which is what I had done): - in first pass over the data, using mapPartitionWithIndex, gather the first item in each partition. you can use collect (or aggregator) for this. “key” them by the partition index. at the end, you will have a map (partition index) -- first item - in the second pass over the data, using mapPartitionWithIndex again, look at two (or in the general case N items at a time, you can use scala’s sliding iterator) items at a time and check the time difference(or any sliding window computation). To this mapParitition, pass the map created in previous step. You will need to use them to check the last item in this partition. If you can tolerate a few inaccuracies then you can just do the second step. You will miss the “boundaries” of the partitions but it might be acceptable for your use case. On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling rnowl...@gmail.com wrote: That's an interesting idea! I hadn't considered that. However, looking at the Partitioner interface, I would need to know from looking at a single key which doesn't fit my case, unfortunately. For my case, I need to compare successive pairs of keys. (I'm trying to re-join lines that were split prematurely.) On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: could you use a custom partitioner to preserve boundaries such that all related tuples end up on the same partition? On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks, Reynold. I still need to handle incomplete groups that fall between partition boundaries. So, I need a two-pass approach. I came up with a somewhat hacky way to handle those using the partition indices and key-value pairs as a second pass after the first. OCaml's std library provides a function called group() that takes a break function that operators on pairs of successive elements. It seems a similar approach could be used in Spark and would be more efficient than my approach with key-value pairs since you know the ordering of the partitions. Has this need been expressed by others? On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote: Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over them to decide which runs of elements to group together: [Item1 Item2] [Item3] [Item4 Item5 Item6] ... Technically, I could use aggregate to do this, but I would have to use a List of List of T which would produce a very large collection in memory. Is there an easy way to accomplish this? e.g.,, it would be nice to have a version of aggregate where the combination function can return a complete group that is added to the new RDD and an incomplete group which is passed to the next call of the reduce function. Thanks, RJ
Grouping runs of elements in a RDD
Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over them to decide which runs of elements to group together: [Item1 Item2] [Item3] [Item4 Item5 Item6] ... Technically, I could use aggregate to do this, but I would have to use a List of List of T which would produce a very large collection in memory. Is there an easy way to accomplish this? e.g.,, it would be nice to have a version of aggregate where the combination function can return a complete group that is added to the new RDD and an incomplete group which is passed to the next call of the reduce function. Thanks, RJ
Re: Grouping runs of elements in a RDD
That's an interesting idea! I hadn't considered that. However, looking at the Partitioner interface, I would need to know from looking at a single key which doesn't fit my case, unfortunately. For my case, I need to compare successive pairs of keys. (I'm trying to re-join lines that were split prematurely.) On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: could you use a custom partitioner to preserve boundaries such that all related tuples end up on the same partition? On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks, Reynold. I still need to handle incomplete groups that fall between partition boundaries. So, I need a two-pass approach. I came up with a somewhat hacky way to handle those using the partition indices and key-value pairs as a second pass after the first. OCaml's std library provides a function called group() that takes a break function that operators on pairs of successive elements. It seems a similar approach could be used in Spark and would be more efficient than my approach with key-value pairs since you know the ordering of the partitions. Has this need been expressed by others? On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote: Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over them to decide which runs of elements to group together: [Item1 Item2] [Item3] [Item4 Item5 Item6] ... Technically, I could use aggregate to do this, but I would have to use a List of List of T which would produce a very large collection in memory. Is there an easy way to accomplish this? e.g.,, it would be nice to have a version of aggregate where the combination function can return a complete group that is added to the new RDD and an incomplete group which is passed to the next call of the reduce function. Thanks, RJ
Re: Grouping runs of elements in a RDD
Thanks, Reynold. I still need to handle incomplete groups that fall between partition boundaries. So, I need a two-pass approach. I came up with a somewhat hacky way to handle those using the partition indices and key-value pairs as a second pass after the first. OCaml's std library provides a function called group() that takes a break function that operators on pairs of successive elements. It seems a similar approach could be used in Spark and would be more efficient than my approach with key-value pairs since you know the ordering of the partitions. Has this need been expressed by others? On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote: Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over them to decide which runs of elements to group together: [Item1 Item2] [Item3] [Item4 Item5 Item6] ... Technically, I could use aggregate to do this, but I would have to use a List of List of T which would produce a very large collection in memory. Is there an easy way to accomplish this? e.g.,, it would be nice to have a version of aggregate where the combination function can return a complete group that is added to the new RDD and an incomplete group which is passed to the next call of the reduce function. Thanks, RJ
Best way to store RDD data?
Hi all, I'm working on an application that has several tables (RDDs of tuples) of data. Some of the types are complex-ish (e.g., date time objects). I'd like to use something like case classes for each entry. What is the best way to store the data to disk in a text format without writing custom parsers? E.g., serializing case classes to/from JSON. What are other users doing? Thanks! RJ -- em rnowl...@gmail.com c 954.496.2314
Re: Multitenancy in Spark - within/across spark context
Ashwin, What is your motivation for needing to share RDDs between jobs? Optimizing for reusing data across jobs? If so, you may want to look into Tachyon. My understanding is that Tachyon acts like a caching layer and you can designate when data will be reused in multiple jobs so it know to keep that in memory or local disk for faster access. But my knowledge of tachyon is second hand so forgive me if I have it wrong :) RJ On Friday, October 24, 2014, Evan Chan velvia.git...@gmail.com wrote: Ashwin, I would say the strategies in general are: 1) Have each user submit separate Spark app (each its own Spark Context), with its own resource settings, and share data through HDFS or something like Tachyon for speed. 2) Share a single spark context amongst multiple users, using fair scheduler. This is sort of like having a Hadoop resource pool.It has some obvious HA/SPOF issues, namely that if the context dies then every user using it is also dead. Also, sharing RDDs in cached memory has the same resiliency problems, namely that if any executor dies then Spark must recompute / rebuild the RDD (it tries to only rebuild the missing part, but sometimes it must rebuild everything). Job server can help with 1 or 2, 2 in particular. If you have any questions about job server, feel free to ask at the spark-jobserver google group. I am the maintainer. -Evan On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin van...@cloudera.com javascript:; wrote: You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174. On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com javascript:; wrote: Upvote for the multitanency requirement. I'm also building a data analytic platform and there'll be multiple users running queries and computations simultaneously. One of the paint point is control of resource size. Users don't really know how much nodes they need, they always use as much as possible... The result is lots of wasted resource in our Yarn cluster. A way to 1) allow multiple spark context to share the same resource or 2) add dynamic resource management for Yarn mode is very much wanted. Jianshi On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com javascript:; wrote: On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar ashwinshanka...@gmail.com javascript:; wrote: That's not something you might want to do usually. In general, a SparkContext maps to a user application My question was basically this. In this page in the official doc, under Scheduling within an application section, it talks about multiuser and fair sharing within an app. How does multiuser within an application work(how users connect to an app,run their stuff) ? When would I want to use this ? I see. The way I read that page is that Spark supports all those scheduling options; but Spark doesn't give you the means to actually be able to submit jobs from different users to a running SparkContext hosted on a different process. For that, you'll need something like the job server that I referenced before, or write your own framework for supporting that. Personally, I'd use the information on that page when dealing with concurrent jobs in the same SparkContext, but still restricted to the same user. I'd avoid trying to create any application where a single SparkContext is trying to be shared by multiple users in any way. As far as I understand, this will cause executors to be killed, which means that Spark will start retrying tasks to rebuild the data that was held by those executors when needed. I basically wanted to find out if there were any gotchas related to preemption on Spark. Things like say half of an application's executors got preempted say while doing reduceByKey, will the application progress with the remaining resources/fair share ? Jobs should still make progress as long as at least one executor is available. The gotcha would be the one I mentioned, where Spark will fail your job after x executors failed, which might be a common occurrence when preemption is enabled. That being said, it's a configurable option, so you can set x to a very large value and your job should keep on chugging along. The options you'd want to take a look at are: spark.task.maxFailures and spark.yarn.max.executor.failures -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:; -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; For additional
Re: New API for TFIDF generation in Spark 1.1.0
Jatin, If you file the JIRA and don't want to work on it, I'd be happy to step in and take a stab at it. RJ On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote: Hi Jatin, HashingTF should be able to solve the memory problem if you use a small feature dimension in HashingTF. Please do not cache the input document, but cache the output from HashingTF and IDF instead. We don't have a label indexer yet, so you need a label to index map to map it to double values, e.g., D1 - 0.0, D2 - 1.0, etc. Assuming that the input is an RDD[(label: String, doc: Seq[String])], the code should look like the following: val docTypeToLabel = Map(D1 - 0.0, ...) val tf = new HashingTF(); val freqs = input.map(x = (docTypeToLabel(x._1), tf.transform(x._2))).cache() val idf = new IDF() val idfModel = idf.fit(freqs.values) val vectors = freqs.map(x = LabeledPoint(x._1, idfModel.transform(x._2))) val nbModel = NaiveBayes.train(vectors) IDF doesn't provide the filter on the min occurrence, but it is nice to put that option. Please create a JIRA and someone may work on it. Best, Xiangrui On Thu, Sep 18, 2014 at 3:46 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I have been running into memory overflow issues while creating TFIDF vectors to be used in document classification using MLlib's Naive Baye's classification implementation. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ Memory overflow and GC issues occur while collecting idfs for all the terms. To give an idea of scale, I am reading around 615,000(around 4GB of text data) small sized documents from HBase and running the spark program with 8 cores and 6GB of executor memory. I have tried increasing the parallelism level and shuffle memory fraction but to no avail. The new TFIDF generation APIs caught my eye in the latest Spark version 1.1.0. The example given in the official documentation mentions creation of TFIDF vectors based of Hashing Trick. I want to know if it will solve the mentioned problem by benefiting from reduced memory consumption. Also, the example does not state how to create labeled points for a corpus of pre-classified document data. For example, my training input looks something like this, DocumentType | Content - D1 | This is Doc1 sample. D1 | This also belongs to Doc1. D1 | Yet another Doc1 sample. D2 | Doc2 sample. D2 | Sample content for Doc2. D3 | The only sample for Doc3. D4 | Doc4 sample looks like this. D4 | This is Doc4 sample content. I want to create labeled points from this sample data for training. And once the Naive Bayes model is created, I generate TFIDFs for the test documents and predict the document type. If the new API can solve my issue, how can I generate labelled points using the new APIs? An example would be great. Also, I have a special requirement of ignoring terms that occur in less than two documents. This has important implications for the accuracy of my use case and needs to be accommodated while generating TFIDFs. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- em rnowl...@gmail.com c 954.496.2314