Re: Dataset -- Schema for type scala.collection.Set[scala.Int] is not supported

2019-08-09 Thread Mohit Jaggi
switched to immutable.Set and it works. this is weird as the code in ScalaReflection.scala seems to support scala.collection.Set cc: dev list, in case this is a bug On Thu, Aug 8, 2019 at 8:41 PM Mohit Jaggi wrote: > Is this not supported? I found this diff > <https://github.com/apa

Dataset -- Schema for type scala.collection.Set[scala.Int] is not supported

2019-08-08 Thread Mohit Jaggi
Is this not supported? I found this diff and wonder if this is a bug or am I doing something wrong? see below = import scala.collection.Set case class A(ps: Set[Int], x: Int) val az = Seq(A(Set(1, 2), 1), A(Set(2), 2)) az.toDS

Re: dataset best practice question

2019-01-18 Thread Mohit Jaggi
> ds_b = ds_a > > .withColumn(“f4”, someUdf) > > .withColumn(“f5”, someUdf) > > .withColumn(“f6”, someUdf) > > .as[B] > > > > Kevin > > > > *From:* Mohit Jaggi > *Sent:* Tuesday, January 15, 2019 1:31 PM > *To:* user > *Subjec

dataset best practice question

2019-01-15 Thread Mohit Jaggi
Fellow Spark Coders, I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this: df_a= read_csv df_b = df.withColumn ( some_transform_that_adds_more_columns ) //repeat the above several times With datasets, this will require defining

Re: Pyspark access to scala/java libraries

2018-07-17 Thread Mohit Jaggi
ala-function-from-a-task >> >> ​Sent with ProtonMail Secure Email.​ >> >> ‐‐‐ Original Message ‐‐‐ >> >> On July 15, 2018 8:01 AM, Mohit Jaggi wrote: >> >> > Trying again…anyone know how to make this work? >> > >> > > On Jul

Re: Pyspark access to scala/java libraries

2018-07-15 Thread Mohit Jaggi
Trying again…anyone know how to make this work? > On Jul 9, 2018, at 3:45 PM, Mohit Jaggi wrote: > > Folks, > I am writing some Scala/Java code and want it to be usable from pyspark. > > For example: > class MyStuff(addend: Int) { > def myMapFunction(x: Int) = x

Pyspark access to scala/java libraries

2018-07-09 Thread Mohit Jaggi
Folks, I am writing some Scala/Java code and want it to be usable from pyspark. For example: class MyStuff(addend: Int) { def myMapFunction(x: Int) = x + addend } I want to call it from pyspark as: df = ... mystuff = sc._jvm.MyStuff(5) df[‘x’].map(lambda x: mystuff.myMapFunction(x))

Re: SparkILoop doesn't run

2016-11-28 Thread Mohit Jaggi
owhere in spark and can be removed without any issues. > > > > On Thu, Nov 17, 2016 at 11:16 AM, Mohit Jaggi <mohitja...@gmail.com> > wrote: > >> Thanks Holden. I did post to the user list but since this is not a > common > >> case, I am trying the develop

SparkILoop doesn't run

2016-11-16 Thread Mohit Jaggi
doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Process finis

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-03 Thread Mohit Jaggi
For linear regression, it should be fairly easy. Just sort the co-efficients :) Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <carlo.allo...@open.ac.uk> wrote: > > Hi All, > > I am using SPARK and in partic

using SparkILoop.run

2016-09-26 Thread Mohit Jaggi
I want to use the following API SparkILoop.run(...). I am writing a test case as that passes some scala code to spark interpreter and receives result as string. I couldn't figure out how to pass the right settings into the run() method. I get an error about "master' not being set. object

Re: Model abstract class in spark ml

2016-08-31 Thread Mohit Jaggi
Thanks Cody. That was a good explanation! Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Aug 31, 2016, at 7:32 AM, Cody Koeninger <c...@koeninger.org> wrote: > > http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/ > > On Wed, Au

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
new AA(1) } Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Aug 30, 2016, at 9:51 PM, Mohit Jaggi <mohitja...@gmail.com> wrote: > > thanks Sean. I am cross posting on dev to see why the code was written that > way. Perhaps, this.type doesn’t do what i

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
thanks Sean. I am cross posting on dev to see why the code was written that way. Perhaps, this.type doesn’t do what is needed. Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com On Aug 30, 2016, at 2:08 PM, Sean Owen <so...@cloudera.com> wrote: I think it's imitating, for e

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
thanks Sean. I am cross posting on dev to see why the code was written that way. Perhaps, this.type doesn’t do what is needed. Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Aug 30, 2016, at 2:08 PM, Sean Owen <so...@cloudera.com> wrote: > > I think

Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
Folks, I am having a bit of trouble understanding the following: abstract class Model[M <: Model[M]] Why is M <: Model[M]? Cheers, Mohit.

Re: How Spark HA works

2016-08-23 Thread Mohit Jaggi
sending a spark job and if you used the right master config in your code, it should go to the new master. that will confirm that failover worked. Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Aug 19, 2016, at 8:56 PM, Charles Nnamdi Akalugwu <cprenzb...@gmail.com>

Re: Using spark to distribute jobs to standalone servers

2016-08-23 Thread Mohit Jaggi
It is a bit hacky but possible. A lot depends on what kind of queries etc you want to run. You could write a data source that reads your data and keeps it partitioned the way you want, then use mapPartitions() to execute your code… Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com

Re: Spark with Parquet

2016-08-23 Thread Mohit Jaggi
something like this should work…. val df = sparkSession.read.csv(“myfile.csv”) //you may have to provide a schema if the guessed schema is not accurate df.write.parquet(“myfile.parquet”) Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Apr 27, 2014, at 11:41 PM,

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-12 Thread Mohit Jaggi
Looks like a bug in the code generating the SQL query…why would it be specific to SAS, I can’t guess. Did you try the same with another database? As a workaround you can write the select statement yourself instead of just providing the table name. > On Jun 11, 2016, at 6:27 PM, Ajay Chander

Re: Calling Python code from Scala

2016-04-18 Thread Mohit Jaggi
When faced with this issue I followed the approach taken by pyspark and used py4j. You have to: - ensure your code is Java compatible - use py4j to call the java (scala) code from python > On Apr 18, 2016, at 10:29 AM, Holden Karau wrote: > > So if there is just a few

spark dataframe gc

2015-07-23 Thread Mohit Jaggi
Hi There, I am testing Spark DataFrame and havn't been able to get my code to finish due to what I suspect are GC issues. My guess is that GC interferes with heartbeating and executors are detected as failed. The data is ~50 numeric columns, ~100million rows in a CSV file. We are doing a groupBy

Re: Grouping runs of elements in a RDD

2015-07-02 Thread Mohit Jaggi
be moved to spark-core. not sure if that happened ] - previous posts --- http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi mohitja...@gmail.com wrote: http://mail-archives.apache.org/mod_mbox/spark

Re: Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Mohit Jaggi
I have used VoltDB and Spark. The use cases for the two are quite different. VoltDB is intended for transactions and also supports queries on the same(custom to voltdb) store. Spark(SQL) is NOT suitable for transactions; it is designed for querying immutable data (which may exist in several

Re: Parsing CSV files in Spark

2015-02-06 Thread Mohit Jaggi
As Sean said, this is just a few lines of code. You can see an example here: https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660 https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660 On Feb 6, 2015,

Re: spark with cdh 5.2.1

2015-02-04 Thread Mohit Jaggi
profile. The hadoop-2.4 profile is for Hadoop 2.4 and beyond. You can set the particular version you want with -Dhadoop.version= You do not need to make any new profile to compile vs 2.5.0-cdh5.2.1. Again, the hadoop-2.4 profile is what you need. On Thu, Jan 29, 2015 at 11:33 PM, Mohit Jaggi

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread Mohit Jaggi
key and value and then using combine, however. — FG On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi mohitja...@gmail.com mailto:mohitja...@gmail.com wrote: Hi All, I have a use case where I have an RDD (not a k,v pair) where I want to do a combineByKey() operation. I can do

spark with cdh 5.2.1

2015-01-29 Thread Mohit Jaggi
Hi All, I noticed in pom.xml that there is no entry for Hadoop 2.5. Has anyone tried Spark with 2.5.0-cdh5.2.1? Will replicating the 2.4 entry be sufficient to make this work? Mohit. - To unsubscribe, e-mail:

Re: spark challenge: zip with next???

2015-01-29 Thread Mohit Jaggi
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E you can use the MLLib

RDD.combineBy

2015-01-27 Thread Mohit Jaggi
Hi All, I have a use case where I have an RDD (not a k,v pair) where I want to do a combineByKey() operation. I can do that by creating an intermediate RDD of k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it will be more efficient if I can avoid this intermediate RDD.

Re: RDD Moving Average

2015-01-09 Thread Mohit Jaggi
Read this: http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E

Re: Modifying an RDD in forEach

2014-12-06 Thread Mohit Jaggi
Ron, “appears to be working” might be true when there are no failures. on large datasets being processed on a large number of machines, failures of several types(server, network, disk etc) can happen. At that time, Spark will not “know” that you changed the RDD in-place and will use any version

Re: Bug in Accumulators...

2014-11-22 Thread Mohit Jaggi
perhaps the closure ends up including the main object which is not defined as serializable...try making it a case object or object main extends Serializable. On Sat, Nov 22, 2014 at 4:16 PM, lordjoe lordjoe2...@gmail.com wrote: I posted several examples in java at

MEMORY_ONLY_SER question

2014-11-04 Thread Mohit Jaggi
Folks, If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed for a transformation/action later, is the whole partition of the RDD deserialized into Java objects first before my transform/action code works on it? Or is it deserialized in a streaming manner as the iterator moves

Re: how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-22 Thread Mohit Jaggi
i think you can give a list of jars - not just one - to spark-submit, so build only the one that has changed source code. On Wed, Oct 22, 2014 at 10:29 PM, Yang tedd...@gmail.com wrote: during tests, I often modify my code a little bit and want to see the result. but spark-submit

Re: scala 2.11?

2014-09-16 Thread Mohit Jaggi
occur until the second half of November at the earliest. On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja

scala 2.11?

2014-09-15 Thread Mohit Jaggi
Folks, I understand Spark SQL uses quasiquotes. Does that mean Spark has now moved to Scala 2.11? Mohit.

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
Is this code running in an executor? You need to make sure the file is accessible on ALL executors. One way to do that is to use a distributed filesystem like HDFS or GlusterFS. On Mon, Sep 15, 2014 at 8:51 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to perform some

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
to it. On Mon, Sep 15, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote: Is this code running in an executor? You need to make sure the file is accessible on ALL executors. One way to do that is to use a distributed filesystem like HDFS or GlusterFS. On Mon, Sep 15, 2014 at 8:51 AM, rapelly

Re: scala 2.11?

2014-09-15 Thread Mohit Jaggi
ah...thanks! On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote: No, not yet. Spark SQL is using org.scalamacros:quasiquotes_2.10. On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I understand Spark SQL uses quasiquotes. Does

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
...@gmail.com wrote: I came across these APIs in one the scala tutorials over the net. On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi mohitja...@gmail.com wrote: But the above APIs are not for HDFS. On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Yes. I have

Re: sc.textFile problem due to newlines within a CSV record

2014-09-13 Thread Mohit Jaggi
, 2014 at 7:43 PM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I think this might be due to the default TextInputFormat in Hadoop. Any pointers to solutions much appreciated. More powerfully, you can define your own InputFormat implementations to format the input to your programs however

slides from df talk at global big data conference

2014-09-12 Thread Mohit Jaggi
http://engineering.ayasdi.com/2014/09/11/df-dataframes-on-spark/

Re: efficient zipping of lots of RDDs

2014-09-11 Thread Mohit Jaggi
filed jira SPARK-3489 https://issues.apache.org/jira/browse/SPARK-3489 On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I sent an email announcing https://github.com/AyasdiOpenSource/df This dataframe is basically a map of RDDs of columns(along with DSL

pandas-like dataframe in spark

2014-09-04 Thread Mohit Jaggi
Folks, I have been working on a pandas-like dataframe DSL on top of spark. It is written in Scala and can be used from spark-shell. The APIs have the look and feel of pandas which is a wildly popular piece of software data scientists use. The goal is to let people familiar with pandas scale their

efficient zipping of lots of RDDs

2014-09-04 Thread Mohit Jaggi
Folks, I sent an email announcing https://github.com/AyasdiOpenSource/df This dataframe is basically a map of RDDs of columns(along with DSL sugar), as column based operations seem to be most common. But row operations are not uncommon. To get rows out of columns right now I zip the column RDDs

Re: advice sought on spark/cassandra input development - scala or python?

2014-09-04 Thread Mohit Jaggi
Johnny, Without knowing the domain of the problem it is hard to choose a programming language. I would suggest you ask yourself the following questions: - What if your project depends on a lot of python libraries that don't have Scala/Java counterparts? It is unlikely but possible. - What if

Re: Object serialisation inside closures

2014-09-04 Thread Mohit Jaggi
I faced the same problem and ended up using the same approach that Sean suggested https://github.com/AyasdiOpenSource/df/blob/master/src/main/scala/com/ayasdi/df/DF.scala#L313 Option 3 also seems reasonable. It should create a CSVParser per executor. On Thu, Sep 4, 2014 at 6:58 AM, Andrianasolo

Re: pandas-like dataframe in spark

2014-09-04 Thread Mohit Jaggi
that stuff like zipping two data frames might become harder, but the overall benefit in performance could be substantial. Matei On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: Folks, I have been working on a pandas-like dataframe DSL on top of spark. It is written

why classTag not typeTag?

2014-08-22 Thread Mohit Jaggi
Folks, I am wondering why Spark uses ClassTag in RDD[T: ClassTag] instead of the more functional TypeTag option. I have some code that needs TypeTag functionality and I don't know if a typeTag can be converted to a classTag. Mohit.

kryo out of buffer exception

2014-08-16 Thread Mohit Jaggi
Hi All, I was doing a groupBy and apparently some keys were very frequent making the serializer fail with buffer overflow exception. I did not need a groupBy so I switched to combineByKey in this case but would like to know how to increase the kryo buffer sizes to avoid this error. I hope there is

closure issue - works in scalatest but not in spark-shell

2014-08-15 Thread Mohit Jaggi
Folks, I wrote the following wrapper on top on combineByKey. The RDD is of Array[Any] and I am extracting a field at a given index for combining. There are two ways in which I tried this: Option A: leave colIndex abstract in Aggregator class and define in derived object Aggtor with value -1. It

sparkcontext stop and then start again

2014-07-25 Thread Mohit Jaggi
Folks, I had some pyspark code which used to hang with no useful debug logs. It got fixed when I changed my code to keep the sparkcontext forever instead of stopping it and then creating another one later. Is this a bug or expected behavior? Mohit.

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-14 Thread Mohit Jaggi
...@gmail.com wrote: I think this is probably dying on the driver itself, as you are probably materializing the whole dataset inside your python driver. How large is spark_data_array compared to your driver memory? On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi mohitja...@gmail.com wrote: I put

pyspark sc.parallelize running OOM with smallish data

2014-07-11 Thread Mohit Jaggi
spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in the cluster and gave 48g to executors. also tried kyro serialization. traceback (most recent call last): File /mohit/./m.py, line 58, in module spark_data = sc.parallelize(spark_data_array) File

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-11 Thread Mohit Jaggi
) in scala. On Fri, Jul 11, 2014 at 2:00 PM, Mohit Jaggi mohitja...@gmail.com wrote: spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in the cluster and gave 48g to executors. also tried kyro serialization. traceback (most recent call last): File /mohit/./m.py, line 58

Re: pyspark regression results way off

2014-06-25 Thread Mohit Jaggi
Is a python binding for LBFGS in the works? My co-worker has written one and can contribute back if it helps. On Mon, Jun 16, 2014 at 11:00 AM, DB Tsai dbt...@stanford.edu wrote: Is your data normalized? Sometimes, GD doesn't work well if the data has wide range. If you are willing to write

kibana like frontend for spark

2014-06-20 Thread Mohit Jaggi
Folks, I want to analyse logs and I want to use spark for that. However, elasticsearch has a fancy frontend in Kibana. Kibana's docs indicate that it works with elasticsearch only. Is there a similar frontend that can work with spark? Mohit. P.S.: On MapR's spark FAQ I read a statement like

Re: spark with docker: errors with akka, NAT?

2014-06-19 Thread Mohit Jaggi
, Jun 17, 2014 at 7:49 PM, Aaron Davidson ilike...@gmail.com wrote: Yup, alright, same solution then :) On Tue, Jun 17, 2014 at 7:39 PM, Mohit Jaggi mohitja...@gmail.com wrote: I used --privileged to start the container and then unmounted /etc/hosts. Then I created a new /etc/hosts file

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Mohit Jaggi
I am using cutting edge code from git but doing my own sbt assembly. On Mon, Jun 16, 2014 at 10:28 PM, Andre Schumacher schum...@icsi.berkeley.edu wrote: Hi, are you using the amplab/spark-1.0.0 images from the global registry? Andre On 06/17/2014 01:36 AM, Mohit Jaggi wrote: Hi

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Mohit Jaggi
to modify the /etc/hosts directly? I remember issues with that as docker apparently mounts it as part of its read-only filesystem. On Tue, Jun 17, 2014 at 4:36 PM, Mohit Jaggi mohitja...@gmail.com wrote: It was a DNS issue. AKKA apparently uses the hostname of the endpoints and hence they need

spark with docker: errors with akka, NAT?

2014-06-16 Thread Mohit Jaggi
Hi Folks, I am having trouble getting spark driver running in docker. If I run a pyspark example on my mac it works but the same example on a docker image (Via boot2docker) fails with following logs. I am pointing the spark driver (which is running the example) to a spark cluster (driver is not

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Mohit Jaggi
, May 21, 2014 at 2:35 PM, Mohit Jaggi mohitja...@gmail.com wrote: Hi, I changed my application to use Joda time instead of java.util.Date and I started getting this: WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1 time so far) What does this mean? How can I fix

Re: accessing partition i+1 from mapper of partition i

2014-05-22 Thread Mohit Jaggi
) = index - partition.reduce(math.max)}.collectAsMap() On Mon, May 19, 2014 at 9:50 PM, Mohit Jaggi mohitja...@gmail.com wrote: Thanks Brian. This works. I used Accumulable to do the collect in step B. While doing that I found that Accumulable.value is not a Spark action, I need to call cache

Re: filling missing values in a sequence

2014-05-20 Thread Mohit Jaggi
there and see whether it fits. -Xiangrui On Mon, May 19, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote: Thanks Sean. Yes, your solution works :-) I did oversimplify my real problem, which has other parameters that go along with the sequence. On Fri, May 16, 2014 at 3:03 AM

Re: filling missing values in a sequence

2014-05-19 Thread Mohit Jaggi
: sc.parallelize(rdd1.first to rdd1.last) On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi mohitja...@gmail.com wrote: Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing

Re: life if an executor

2014-05-19 Thread Mohit Jaggi
I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one.

filling missing values in a sequence

2014-05-15 Thread Mohit Jaggi
Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to slide and zip rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11,

accessing partition i+1 from mapper of partition i

2014-05-14 Thread Mohit Jaggi
Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to slide and zip rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11,

rdd ordering gets scrambled

2014-04-29 Thread Mohit Jaggi
Hi, I started with a text file(CSV) of sorted data (by first column), parsed it into Scala objects using map operation in Scala. Then I used more maps to add some extra info to the data and saved it as text file. The final text file is not sorted. What do I need to do to keep the order from the

Re: error in mllib lr example code

2014-04-24 Thread Mohit Jaggi
are updated in the master branch. You can also check the examples there. -Xiangrui On Wed, Apr 23, 2014 at 9:34 AM, Mohit Jaggi mohitja...@gmail.com wrote: sorry...added a subject now On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com wrote: I am trying to run

spark mllib to jblas calls..and comparison with VW

2014-04-24 Thread Mohit Jaggi
Folks, I am wondering how mllib interacts with jblas and lapack. Does it make copies of data from my RDD format to jblas's format? Does jblas copy it again before passing to lapack native code? I also saw some comparisons with VW and it seems mllib is slower on a single node but scales better and

error in mllib lr example code

2014-04-23 Thread Mohit Jaggi
sorry...added a subject now On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com wrote: I am trying to run the example linear regression code from http://spark.apache.org/docs/latest/mllib-guide.html But I am getting the following error...am I missing an import? code

scheduler question

2014-04-15 Thread Mohit Jaggi
Hi Folks, I have some questions about how Spark scheduler works: - How does Spark know how many resources a job might need? - How does it fairly share resources between multiple jobs? - Does it know about data and partition sizes and use that information for scheduling? Mohit.