Re: N-Fold validation and RDD partitions

2014-03-24 Thread Holden Karau
There is also https://github.com/apache/spark/pull/18 against the current repo which may be easier to apply. On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh a...@adatao.com wrote: Hi Jaonary, You can find the code for k-fold CV in https://github.com/apache/incubator-spark/pull/448. I have

Re: ElasticSearch enrich

2014-06-25 Thread Holden Karau
, Holden Karau hol...@pigscanfly.ca wrote: So I'm giving a talk at the Spark summit on using Spark ElasticSearch, but for now if you want to see a simple demo which uses elasticsearch for geo input you can take a look at my quick dirty implementation with TopTweetsInALocation ( https

Re: ElasticSearch enrich

2014-06-26 Thread Holden Karau
must need to pull up hadoop programatically? (if I can)) b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com On Thu, Jun 26, 2014 at 1:20 AM, Holden Karau hol

Re: ElasticSearch enrich

2014-06-26 Thread Holden Karau
-- Skype: boci13, Hangout: boci.b...@gmail.com On Thu, Jun 26, 2014 at 11:48 PM, Holden Karau hol...@pigscanfly.ca wrote: Hi b0c1, I have an example of how to do this in the repo for my talk as well

Re: ElasticSearch enrich

2014-06-27 Thread Holden Karau
ES index (so how can I set dynamic index in the foreachRDD) ? b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com On Fri, Jun 27, 2014 at 12:30 AM, Holden Karau

Re: ElasticSearch enrich

2014-06-27 Thread Holden Karau
run in IDEA no magick trick. b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com On Fri, Jun 27, 2014 at 11:11 PM, Holden Karau hol...@pigscanfly.ca wrote: So

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Holden Karau
Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran into same issue. What is the solution? -- View this message in context:

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Holden Karau
Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran

Re: MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Holden Karau
Hi Steve, The _ notation can be a bit confusing when starting with Scala, we can rewrite it to avoid using it here. So instead of val numUsers = ratings.map(_._2.user) we can write val numUsers = ratings.map(x = x._2.user) ratings is an Key-Value RDD (which is an RDD comprised of tuples) and so

Re: Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-22 Thread Holden Karau
Hi Michael Campbell, Are you deploying against yarn or standalone mode? In yarn try setting the shell variables SPARK_EXECUTOR_MEMORY=2G in standalone try and set SPARK_WORKER_MEMORY=2G. Cheers, Holden :) On Thu, Oct 16, 2014 at 2:22 PM, Michael Campbell michael.campb...@gmail.com wrote:

Re: saveasSequenceFile with codec and compression type

2014-10-22 Thread Holden Karau
Hi gpatcham, If you want to save as a sequence file with a custom compression type you can use saveAsHadoopFile along with setting the mapred.output.compression.type on the jobconf. If you want to keep using saveAsSequenceFile, and the syntax is much nicer, you could also set that property on

Re: version mismatch issue with spark breeze vector

2014-10-22 Thread Holden Karau
Hi Yang, It looks like your build file is a different version than the version of Spark you are running against. I'd try building against the same version of spark as you are running your application against (1.1.0). Also what is your assembly/shading configuration for your build? Cheers,

Re: OutOfMemory in cogroup

2014-10-27 Thread Holden Karau
On Monday, October 27, 2014, Shixiong Zhu zsxw...@gmail.com wrote: We encountered some special OOM cases of cogroup when the data in one partition is not balanced. 1. The estimated size of used memory is inaccurate. For example, there are too many values for some special keys. Because

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-27 Thread Holden Karau
Can you post the error message you get when trying to save the sequence file? If you call first() on the RDD does it result in the same error? On Mon, Oct 27, 2014 at 6:13 AM, buring qyqb...@gmail.com wrote: Hi: After update spark to version1.1.0, I experienced a snappy error which

Re: exact count using rdd.count()?

2014-10-27 Thread Holden Karau
Hi Josh, The count() call will result in the correct number in each RDD, however foreachRDD doesn't return the result of its computation anywhere (its intended for things which cause side effects, like updating an accumulator or causing an web request), you might want to look at transform or the

Re: Filtering URLs from incoming Internet traffic(Stream data). feasible with spark streaming?

2014-10-27 Thread Holden Karau
On Mon, Oct 27, 2014 at 9:15 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote: According to my knowledge spark streams uses mini batches for processing, Q: Is it a good idea to use my ML trained Model on a web server for filtering purpose to classify URLs as obscene or benin. If spark

Re: Filtering URLs from incoming Internet traffic(Stream data). feasible with spark streaming?

2014-10-27 Thread Holden Karau
On Mon, Oct 27, 2014 at 10:19 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote: I am kinda stuck with spark now :/ i already proposed this model in my synopsis and its already accepted :D spark is a new thing for alot of people. what alternate tool should i use now? You could use Spark to

Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau
Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote: Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I

Re: what does DStream.union() do?

2014-10-29 Thread Holden Karau
The union function simply returns a DStream with the elements from both. This is the same behavior as when we call union on RDDs :) (You can think of union as similar to the union operator on sets except without the unique element restrictions). On Wed, Oct 29, 2014 at 3:15 PM, spr

Re: how to extract/combine elements of an Array in DStream element?

2014-10-29 Thread Holden Karau
On Wed, Oct 29, 2014 at 3:29 PM, spr s...@yarcdata.com wrote: I am processing a log file, from each line of which I want to extract the zeroth and 4th elements (and an integer 1 for counting) into a tuple. I had hoped to be able to index the Array for elements 0 and 4, but Arrays appear not

Re: what does DStream.union() do?

2014-10-29 Thread Holden Karau
of the same templated type. If you have hetrogeneous data you can first map each DStream it to a case class with options or try something like http://stackoverflow.com/questions/3508077/does-scala-have-type-disjunction-union-types Holden Karau wrote The union function simply returns a DStream

Re: Nesting RDD

2014-11-06 Thread Holden Karau
Hi Naveen, Nesting RDDs inside of transformations or actions is not supported. Instead if you need access to the other RDDs contents you can try doing a join or (if the data is small enough) collecting and broadcasting the second RDD. Cheers, Holden :) On Thu, Nov 6, 2014 at 10:28 PM, Naveen

Re: Spark Streaming testing strategies

2015-03-10 Thread Holden Karau
place for this to live since the internal Spark testing code is changing/evolving rapidly, but I think once we have the trait fleshed out a bit more we could see if there is enough interest to try and merge it in (just my personal thoughts). On 1 March 2015 at 18:49, Holden Karau hol

Re: Spark Streaming testing strategies

2015-03-01 Thread Holden Karau
There is also the Spark Testing Base package which is on spark-packages.org and hides the ugly bits (it's based on the existing streaming test code but I cleaned it up a bit to try and limit the number of internals it was touching). On Sunday, March 1, 2015, Marcin Kuthan marcin.kut...@gmail.com

Re: IOUtils cannot write anything in Spark?

2015-04-23 Thread Holden Karau
It seems like saveAsTextFile might do what you are looking for. On Wednesday, April 22, 2015, Xi Shen davidshe...@gmail.com wrote: Hi, I have a RDD of some processed data. I want to write these files to HDFS, but not for future M/R processing. I want to write plain old style text file. I

Re: swap tuple

2015-05-14 Thread Holden Karau
Can you paste your code? transformations return a new RDD rather than modifying an existing one, so if you were to swap the values of the tuple using a map you would get back a new RDD and then you would want to try and print this new RDD instead of the original one. On Thursday, May 14, 2015,

Re: Help needed with Py4J

2015-05-20 Thread Holden Karau
Santosh On May 20, 2015, at 7:26 PM, Holden Karau hol...@pigscanfly.ca wrote: Are your jars included in both the driver and worker class paths? On Wednesday, May 20, 2015, Addanki, Santosh Kumar santosh.kumar.adda...@sap.com wrote: Hi Colleagues We need to call a Scala Class from

Re: Help needed with Py4J

2015-05-20 Thread Holden Karau
Are your jars included in both the driver and worker class paths? On Wednesday, May 20, 2015, Addanki, Santosh Kumar santosh.kumar.adda...@sap.com wrote: Hi Colleagues We need to call a Scala Class from pySpark in Ipython notebook. We tried something like below : from

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Holden Karau
this but it does terrible things to access Spark internals. I also need to call a Hive UDAF in a dataframe agg function. Are there any examples of what Column expects? Deenar On 2 June 2015 at 21:13, Holden Karau hol...@pigscanfly.ca wrote: So for column you need to pass in a Java function, I have some

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) Any idea ? Olivier. Le mar. 2 juin 2015 à 18:02, Holden Karau hol...@pigscanfly.ca javascript:_e(%7B%7D,'cvml','hol...@pigscanfly.ca'); a écrit : Not super easily, the GroupedData class uses a strToExpr function which has a pretty limited set of functions

Re: Standard Scaler taking 1.5hrs

2015-06-04 Thread Holden Karau
take(5) will only evaluate enough partitions to provide 5 elements (sometimes a few more but you get the idea), so it won't trigger a full evaluation of all partitions unlike count(). On Thursday, June 4, 2015, Piero Cinquegrana pcinquegr...@marketshare.com wrote: Hi DB, Yes I am running

Re: map V mapPartitions

2015-06-23 Thread Holden Karau
I think one of the primary cases where mapPartitions is useful if you are going to be doing any setup work that can be re-used between processing each element, this way the setup work only needs to be done once per partition (for example creating an instance of jodatime). Both map and

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
Not super easily, the GroupedData class uses a strToExpr function which has a pretty limited set of functions so we cant pass in the name of an arbitrary hive UDAF (unless I'm missing something). We can instead construct an column with the expression you want and then pass it in to agg() that way

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Holden Karau
Collecting it as a regular (Java/scala/Python) map. You can also broadcast the map if your going to use it multiple times. On Wednesday, July 1, 2015, Ashish Soni asoni.le...@gmail.com wrote: Thanks , So if i load some static data from database and then i need to use than in my map function to

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Holden Karau
So DataFrames, like RDDs, can only be accused from the driver. If your IP Frequency table is small enough you could collect it and distribute it as a hashmap with broadcast or you could also join your rdd with the ip frequency table. Hope that helps :) On Thursday, May 21, 2015, ping yan

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Holden Karau
I just pushed some code that does this for spark-testing-base ( https://github.com/holdenk/spark-testing-base ) (its in master) and will publish an updated artifact with it for tonight. On Fri, Aug 14, 2015 at 3:35 PM, Tathagata Das t...@databricks.com wrote: A hacky workaround is to create a

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-10-21 Thread Holden Karau
> Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > > On Wed, Oct 21, 2015 at 12:55 AM, Holden Karau <hol...@pigscanfly.ca> > wro

Re: Spark with business rules

2015-10-26 Thread Holden Karau
Spark SQL seems like it might be the best interface if your users are already familiar with SQL. On Mon, Oct 26, 2015 at 3:12 PM, danilo wrote: > Hi All, I want to create a monitoring tool using my sensor data. I receive > the events every seconds and I need to create a

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-10-27 Thread Holden Karau
flow.com/users/1305344/jacek-laskowski > > > On Wed, Oct 21, 2015 at 12:55 AM, Holden Karau <hol...@pigscanfly.ca> > wrote: > > Hi SF based folks, > > > > I'm going to try doing some simple office hours this Friday afternoon > > outside of Paramo Coffee. If n

Re: Spark-Testing-Base Q/A

2015-10-28 Thread Holden Karau
And now (before 1am California time :p) there is a new version of spark-testing base which adds a java base class for streaming tests. I noticed you were using 1.3 so I put in the effort to make this release for Spark 1.3 to 1.5 inclusive). On Wed, Oct 21, 2015 at 4:16 PM, Holden Karau <

Re: Spark-Testing-Base Q/A

2015-10-21 Thread Holden Karau
rs are willing to use Scala or Python. > Sounds reasonable, I'll add it this week. > > If i am not wrong it’s 4:00 AM for you in California ;) > > Yup, I'm not great a regular schedules but I make up for it by doing stuff when I've had too much coffee to sleep :p > Regards, >

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-11-10 Thread Holden Karau
, 2015 at 11:43 AM, Holden Karau <hol...@pigscanfly.ca> wrote: > So I'm going to try and do these again, with an on-line ( > http://doodle.com/poll/cr9vekenwims4sna ) and SF version ( > http://doodle.com/poll/ynhputd974d9cv5y ). You can help me pick a day > that works for

SF Spark Office Hours Experiment - Friday Afternoon

2015-10-20 Thread Holden Karau
Hi SF based folks, I'm going to try doing some simple office hours this Friday afternoon outside of Paramo Coffee. If no one comes by I'll just be drinking coffee hacking on some Spark PRs so if you just want to hangout and hack on Spark as a group come by too. (See

Re: Unit tests of spark application

2015-07-10 Thread Holden Karau
Somewhat biased of course, but you can also use spark-testing-base from spark-packages.org as a basis for your unittests. On Fri, Jul 10, 2015 at 12:03 PM, Daniel Siegmann daniel.siegm...@teamaol.com wrote: On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire vmadh...@umail.iu.edu wrote: I want

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Holden Karau
Yes, any java serializable object. Its important to note that since its saving serialized objects it is as brittle as java serialization when it comes to version changes, so if you can make your data fit in something like sequence files, parquet, avro, or similar it can be not only more space

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Holden Karau
So println of any array of strings will look like that. The java.util.Arrays class has some options to print arrays nicely. On Thu, Aug 27, 2015 at 2:08 PM, Arun Luthra arun.lut...@gmail.com wrote: What types of RDD can saveAsObjectFile(path) handle? I tried a naive test with an

Re: tweet transformation ideas

2015-08-27 Thread Holden Karau
It seems like this might be better suited to a broadcasted hash map since 200k entries isn't that big. You can then map over the tweets and lookup each word in the broadcasted map. On Thursday, August 27, 2015, Jesse F Chen jfc...@us.ibm.com wrote: This is a question on general usage/best

Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
A common practice to do this is to use foreachRDD with a local var to accumulate the data (you can see it in the Spark Streaming test code). That being said, I am a little curious why you want the driver to create the file specifically. On Friday, September 11, 2015, allonsy

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
he need of repartitioning data? > > Hope I have been clear, I am pretty new to Spark. :) > > 2015-09-11 18:19 GMT+02:00 Holden Karau <hol...@pigscanfly.ca > <javascript:_e(%7B%7D,'cvml','hol...@pigscanfly.ca');>>: > >> A common practice to do this is to use fore

Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I'd do some searching and see if there is a JIRA related to this problem on s3 and if you don't find one go ahead and make one. Even if it is an intrinsic problem with s3 (and I'm not super sure since I'm just reading this on mobile) - it would maybe be a good thing for us to document. On

Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I think your error could possibly be different - looking at the original JIRA the issue was happening on HDFS and you seem to be experiencing the issue on s3n, and while I don't have full view of the problem I could see this being s3 specific (read-after-write on s3 is trickier than

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Holden Karau
Hi Ram, Not super certain what you are looking to do. Are you looking to add a new algorithm to Spark MLlib for streaming or use Spark MLlib on streaming data? Cheers, Holden On Friday, June 10, 2016, Ram Krishna wrote: > Hi All, > > I am new to this this field, I

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Holden Karau
So that's a bit complicated - you might want to start with reading the code for the existing algorithms and go from there. If your goal is to contribute the algorithm to Spark you should probably take a look at the JIRA as well as the contributing to Spark guide on the wiki. Also we have a

Re: --driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-02 Thread Holden Karau
Also seems like this might be better suited for dev@ On Thursday, June 2, 2016, Sun Rui wrote: > yes, I think you can fire a JIRA issue for this. > But why removing the default value. Seems the default core is 1 according > to >

Re: ImportError: No module named numpy

2016-06-01 Thread Holden Karau
Generally this means numpy isn't installed on the system or your PYTHONPATH has somehow gotten pointed somewhere odd, On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra wrote: > If any one please can help me with following error. > > File >

Re: Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Holden Karau
PySpark RDDs are (on the Java side) are essentially RDD of pickled objects and mostly (but not entirely) opaque to the JVM. It is possible (by using some internals) to pass a PySpark DataFrame to a Scala library (you may or may not find the talk I gave at Spark Summit useful

Re: Unit test with sqlContext

2016-02-04 Thread Holden Karau
Thanks for recommending spark-testing-base :) Just wanted to add if anyone has feature requests for Spark testing please get in touch (or add an issue on the github) :) On Thu, Feb 4, 2016 at 8:25 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Hi Steve, > > Have you looked at the

Re: local class incompatible: stream classdesc serialVersionUID

2016-02-01 Thread Holden Karau
So I'm a little confused to exactly how this might have happened - but one quick guess is that maybe you've built an assembly jar with Spark core, can you mark it is a provided and or post your build file? On Fri, Jan 29, 2016 at 7:35 AM, Ted Yu wrote: > I logged

Re: local class incompatible: stream classdesc serialVersionUID

2016-02-01 Thread Holden Karau
1.5.1 executors. > > On Mon, Feb 1, 2016 at 2:08 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > >> So I'm a little confused to exactly how this might have happened - but >> one quick guess is that maybe you've built an assembly jar with Spark core, >> can you m

Re: Using accumulator to push custom logs to driver

2016-02-01 Thread Holden Karau
I wouldn't use accumulators for things which could get large, they can become kind of a bottle neck. Do you have a lot of string messages you want to bring back or only a few? On Mon, Feb 1, 2016 at 3:24 PM, Utkarsh Sengar wrote: > I am trying to debug code executed in

Re: Using accumulator to push custom logs to driver

2016-02-01 Thread Holden Karau
ly add debug statements which returns > info about the dataset etc. > I would assume the strings will vary from 100-200lines max, that would be > about 50-100KB if they are really long lines. > > -Utkarsh > > On Mon, Feb 1, 2016 at 3:40 PM, Holden Karau <hol...@pigscanfly.ca>

Re: Getting Co-oefficients of a logistic regression model for a pipelinemodel Spark ML library

2016-01-21 Thread Holden Karau
Hi Vinayaka, You can access the different stages in your pipeline through the stages array on our pipeline model ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.PipelineModel ) and then cast it to the correct stage (if working in Scala or if in Python just access

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
egistrator","--.MyRegistrator") > .set("spark.kryo.registrationRequired", "true") > .set("spark.yarn.executor.memoryOverhead","600") > > On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <joshro...@databricks.com>

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
Can you post more of your log? How big are the partitions? What is the action you are performing? On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra wrote: > Example warning: > > 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID > 4436, XXX):

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
; --conf "spark.executor.extraJavaOptions=-verbose:gc > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ > > my.jar > > > There are 2262 input files totaling just 98.6G. The DAG is basically > textFile().map().filter().groupByKey().saveAsTextFile(). > >

Re: install databricks csv package for spark

2016-02-19 Thread Holden Karau
So with --packages to spark-shell and spark-submit Spark will automatically fetch the requirements from maven. If you want to use an explicit local jar you can do that with the --jars syntax. You might find http://spark.apache.org/docs/latest/submitting-applications.html useful. On Fri, Feb 19,

Re: Spark stream job is take up /TMP with 100%

2016-02-19 Thread Holden Karau
Thats a good question, you can find most of what you are looking for in the configuration guide at http://spark.apache.org/docs/latest/configuration.html - you probably want to change the spark.local.dir to point to your scratch directory. Out of interest what problems have you been seeing with

Re: Submitting Jobs Programmatically

2016-02-19 Thread Holden Karau
How are you trying to launch your application? Do you have the Spark jars on your class path? On Friday, February 19, 2016, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hello, > > I am trying to submit a spark job via a program. > > When I run it, I receive the following error:

Re: Partitioning to speed up processing?

2016-03-10 Thread Holden Karau
Are they entire data set aggregates or is there some grouping applied? On Thursday, March 10, 2016, Gerhard Fiedler wrote: > I have a number of queries that result in a sequence Filter > Project > > Aggregate. I wonder whether partitioning the input table makes

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Holden Karau
So the run tests command allows you to specify the python version to test again - maybe specify python2.7 On Friday, March 11, 2016, Gayathri Murali wrote: > I do have 2.7 installed and unittest2 package available. I still see this > error : > > Please install

Re: About nested RDD

2016-04-08 Thread Holden Karau
It seems like the union function on RDDs might be what you are looking for, or was there something else you were trying to achieve? On Thursday, April 7, 2016, Tenghuan He wrote: > Hi all, > > I know that nested RDDs are not possible like linke rdd1.map(x => x + >

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Holden Karau
I'm very much in favor of this, the less porting work there is the better :) On Tue, Apr 5, 2016 at 5:32 PM, Joseph Bradley wrote: > +1 By the way, the JIRA for tracking (Scala) API parity is: > https://issues.apache.org/jira/browse/SPARK-4591 > > On Tue, Apr 5, 2016 at

Re: Reading Back a Cached RDD

2016-03-24 Thread Holden Karau
Even checkpoint() is maybe not exactly what you want, since if reference tracking is turned on it will get cleaned up once the original RDD is out of scope and GC is triggered. If you want to share persisted RDDs right now one way to do this is sharing the same spark context (using something like

Re: python support of mapWithState

2016-03-24 Thread Holden Karau
In general the Python API lags behind the Scala & Java APIs. The Scala & Java APIs tend to be easier to keep in sync since they are both in the JVM and a bit more work is needed to expose the same functionality from the JVM in Python (or re-implement the Scala code in Python where appropriate).

Re: Spark reduce serialization question

2016-03-06 Thread Holden Karau
You might want to try treeAggregate On Sunday, March 6, 2016, Takeshi Yamamuro wrote: > Hi, > > I'm not exactly sure what's your codes like though, ISTM this is a correct > behaviour. > If the size of data that a driver fetches exceeds the limit, the driver > throws this

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Holden Karau
So doing a quick look through the README & code for spark-sftp it seems that the way this connector works is by downloading the file locally on the driver program and this is not configurable - so you would probably need to find a different connector (and you probably shouldn't use spark-sftp for

Re: Saving Spark generated table into underlying Hive table using Functional programming

2016-03-07 Thread Holden Karau
So what about if you just start with a hive context, and create your DF using the HiveContext? On Monday, March 7, 2016, Mich Talebzadeh wrote: > Hi, > > I have done this Spark-shell and Hive itself so it works. > > I am exploring whether I can do it programmatically.

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Holden Karau
You can also look at spark-testing-base which works in both Scalatest and Junit and see if that works for your use case. On Friday, April 1, 2016, Ted Yu wrote: > Assuming your code is written in Scala, I would suggest using ScalaTest. > > Please take a look at the

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread Holden Karau
You probably want to look at the map transformation, and the many more defined on RDDs. The function you pass in to map is serialized and the computation is distributed. On Monday, March 28, 2016, charles li wrote: > > use case: have a dataset, and want to use different

Re: Confused - returning RDDs from functions

2016-05-12 Thread Holden Karau
This is not the expected behavior, can you maybe post the code where you are running into this? On Thursday, May 12, 2016, Dood@ODDO wrote: > Hello all, > > I have been programming for years but this has me baffled. > > I have an RDD[(String,Int)] that I return from a

Re: JSON Usage

2016-04-14 Thread Holden Karau
You could certainly use RDDs for that, you might also find using Dataset selecting the fields you need to construct the URL to fetch and then using the map function to be easier. On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim wrote: > I was wonder what would be the best way

Re: error "Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe."

2016-04-14 Thread Holden Karau
The org.apache.spark.sql.execution.EvaluatePython.takeAndServe exception can happen in a lot of places it might be easier to figure out if you have a code snippet you can share where this is occurring? On Wed, Apr 13, 2016 at 2:27 PM, AlexModestov wrote: > I get

Re: how to write pyspark interface to scala code?

2016-04-14 Thread Holden Karau
Its a bit tricky - if the users data is represented in a DataFrame or Dataset then its much easier. Assuming that the function is going to be called from the driver program (e.g. not inside of a transformation or action) then you can use the Py4J context to make the calls. You might find looking

Re: Calling Python code from Scala

2016-04-18 Thread Holden Karau
So if there is just a few python functions your interested in accessing you can also use the pipe interface (you'll have to manually serialize your data on both ends in ways that Python and Scala can respectively parse) - but its a very generic approach and can work with many different languages.

Re: Spark Beginner Question

2016-07-26 Thread Holden Karau
So you will need to convert your input DataFrame into something with vectors and labels to train on - the Spark ML documentation has examples http://spark.apache.org/docs/latest/ml-guide.html (although the website seems to be having some issues mid update to Spark 2.0 so if you want to read it

Re: Spark2 SBT Assembly

2016-08-10 Thread Holden Karau
What are you looking to use the assembly jar for - maybe we can think of a workaround :) On Wednesday, August 10, 2016, Efe Selcuk wrote: > Sorry, I should have specified that I'm specifically looking for that fat > assembly behavior. Is it no longer possible? > > On Wed,

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-10 Thread Holden Karau
Hi Luis, You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you can do groupBy followed by a reduce on the GroupedDataset ( http://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.GroupedDataset ) - this works on a per-key basis despite the different name.

Re: groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Holden Karau
So it looks like (despite the name) pair_rdd is actually a Dataset - my guess is you might have a map on a dataset up above which used to return an RDD but now returns another dataset or an unexpected implicit conversion. Just add rdd() before the groupByKey call to push it into an RDD. That being

Re: pyspark 1.5 0 save model ?

2016-07-18 Thread Holden Karau
If you used RandomForestClassifier from mllib you can use the save method described in http://spark.apache.org/docs/1.5.0/api/python/pyspark.mllib.html#module-pyspark.mllib.classification which will write out some JSON metadata as well as parquet for the actual model. For the newer ml pipeline one

Re: Spark 7736

2016-07-19 Thread Holden Karau
Indeed there is, signup for an Apache JIRA account then then when you visit the JIRA page logged in you should see a "reopen issue" button. For issues like this (reopening a JIRA) - you might find the dev list to be more useful. On Wed, Jul 13, 2016 at 4:47 AM, ayan guha

Re: which one spark ml or spark mllib

2016-07-19 Thread Holden Karau
So Spark ML is going to be the actively developed Machine Learning library going forward, however back in Spark 1.5 it was still relatively new and an experimental component so not all of the the save/load support implemented for the same models. That being said for 2.0 ML doesn't have PMML export

Re: spark single PROCESS_LOCAL task

2016-07-19 Thread Holden Karau
So its possible that you have a lot of data in one of the partitions which is local to that process, maybe you could cache & count the upstream RDD and see what the input partitions look like? On the otherhand - using groupByKey is often a bad sign to begin with - can you rewrite your code to

Re: Should it be safe to embed Spark in Local Mode?

2016-07-19 Thread Holden Karau
That's interesting and might be better suited to the dev list. I know in some cases System exit off -1 were added so the task would be marked as failure. On Tuesday, July 19, 2016, Brett Randall wrote: > This question is regarding >

Working of Streaming Kmeans

2016-07-05 Thread Holden Karau
Hi Biplob, The current Streaming KMeans code only updates data which comes in through training (e.g. trainOn), predictOn does not update the model. Cheers, Holden :) P.S. Traffic on the list might be have been bit slower right now because of Canada Day and 4th of July weekend respectively.

Re: Bootstrap Action to Install Spark 2.0 on EMR?

2016-07-05 Thread Holden Karau
Just to be clear Spark 2.0 isn't released yet, there is a preview version for developers to explore and test compatibility with. That being said Roy Hasson has a blog post discussing using Spark 2.0-preview with EMR -

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Holden Karau
requires classtag > > sparkSession.sparkContext().broadcast > > On Mon, Aug 8, 2016 at 12:09 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > >> Classtag is Scala concept (see http://docs.scala-lang.or >> g/overviews/reflection/typetags-manifests.html) - altho

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Holden Karau
Classtag is Scala concept (see http://docs.scala-lang.org/overviews/reflection/typetags-manifests.html) - although this should not be explicitly required - looking at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext we can see that in Scala the classtag tag is

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Holden Karau
Thats a good point - there is an open issue for spark-testing-base to support this shared sparksession approach - but I haven't had the time ( https://github.com/holdenk/spark-testing-base/issues/123 ). I'll try and include this in the next release :) On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers

Re: Call Scala API from PySpark

2016-06-30 Thread Holden Karau
So I'm a little biased - I think the bet bride between the two is using DataFrames. I've got some examples in my talk and on the high performance spark GitHub https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py calls

Re: ML version of Kmeans

2017-01-31 Thread Holden Karau
You most likely want the transform function on KMeansModel (although that works on a dataset input rather than a single element at a time). On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I am not able to find predict method on "ML" version of

  1   2   3   >