Re: Making Unpersist Lazy

2015-07-02 Thread Akhil Das
rdd's which are no longer required will be removed from memory by spark itself (which you can consider as lazy?). Thanks Best Regards On Wed, Jul 1, 2015 at 7:48 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, The current behavior of rdd.unpersist() appears to not be lazily executed and

Re: Performance tuning in Spark SQL.

2015-07-02 Thread prosp4300
Please see below link for the ways available https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#performance-tuning For example, reduce spark.sql.shuffle.partitions from 200 to 10 could improve the performance significantly -- View this message in context:

Re: Illegal access error when initializing SparkConf

2015-07-02 Thread Ramprakash Ramamoorthy
Team, Got this fixed. After so much of juggling, when I replaced JRE7 with JRE8, things started working the way intended. Cheers! On Wed, Jul 1, 2015 at 7:07 PM, Ramprakash Ramamoorthy youngestachie...@gmail.com wrote: Team, I'm just playing around with spark and mllib. Installed scala and

Re: Convert CSV lines to List of Objects

2015-07-02 Thread Akhil Das
Have a look at the sc.wholeTextFiles, you can use it to read the whole csv contents into the value and then split it on \n and add them up to a list and return it. *sc.wholeTextFiles:* Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported

Re: DataFrame Find/Filter Based on Input - Inside Map function

2015-07-02 Thread ayan guha
You can keep a joined dataset cached and filter that joined df with your filter condition On 2 Jul 2015 15:01, Mailing List asoni.le...@gmail.com wrote: I need to pass the value of the filter dynamically like where id=someVal and that someVal exist in another RDD. How can I do this across

Re: Making Unpersist Lazy

2015-07-02 Thread Jem Tucker
Hi, After running some tests it appears the unpersist is called as soon as it is reached, so any tasks using this rdd later on will have to re calculate it. This is fine for simple programs but when an rdd is created within a function and its reference is then lost but children of it continue to

Re: getting WARN ReliableDeliverySupervisor

2015-07-02 Thread xiaohe lan
Change jdk from 1.8.0_45 to 1.7.0_79 solve this issue. I saw https://issues.apache.org/jira/browse/SPARK-6388 But it is not a problem however. On Thu, Jul 2, 2015 at 1:30 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi Expert, Hadoop version: 2.4 Spark version: 1.3.1 I am running the

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-02 Thread Suraj Shetiya
Hi Michael, Thanks for a quick response.. This sounds like something that would work. However, Rethinking the problem statement and various other use cases, which are growing, there are more such scenarios, where one could have columns with structured and unstructured data embedded (json or xml

Re: .NET on Apache Spark?

2015-07-02 Thread Daniel Darabos
Indeed Spark does not have .NET bindings. On Thu, Jul 2, 2015 at 10:33 AM, Zwits daniel.van...@ortec-finance.com wrote: I'm currently looking into a way to run a program/code (DAG) written in .NET on a cluster using Spark. However I ran into problems concerning the coding language, Spark has

Re: Meets class not found error in spark console with newly hive context

2015-07-02 Thread Terry Hole
Found this a bug in spark 1.4.0: SPARK-8368 https://issues.apache.org/jira/browse/SPARK-8368 Thanks! Terry On Thu, Jul 2, 2015 at 1:20 PM, Terry Hole hujie.ea...@gmail.com wrote: All, I am using spark console 1.4.0 to do some tests, when a create a newly HiveContext (Line 18 in the code) in

insert overwrite table phonesall in spark-sql resulted in java.io.StreamCorruptedException

2015-07-02 Thread John Jay
My spark-sql command: spark-sql --driver-memory 2g --master spark://hadoop04.xx.xx.com:8241 --conf spark.driver.cores=20 --conf spark.cores.max=20 --conf spark.executor.memory=2g --conf spark.driver.memory=2g --conf spark.akka.frameSize=500 --conf spark.eventLog.enabled=true --conf

Re: Spark driver hangs on start of job

2015-07-02 Thread Sjoerd Mulder
Hi Richard, I have actually applied the following fix to our 1.4.0 version and this seem to resolve the zombies :) https://github.com/apache/spark/pull/7077/files Sjoerd 2015-06-26 20:08 GMT+02:00 Richard Marscher rmarsc...@localytics.com: Hi, we are on 1.3.1 right now so in case there are

All master are unreponsive issue

2015-07-02 Thread luohui20001
Hi there: I got an problem that Application has been killed.Reason:All masters are unresponsive!Giving up. I check the network I/O and found sometimes it is really high when running my app. Pls refer to the attached pic for more info.I also checked

RE: .NET on Apache Spark?

2015-07-02 Thread Silvio Fiorito
Since Spark runs on the JVM, no there isn't support for .Net. You should take a look at Dryad and Naiad instead. https://github.com/MicrosoftResearch/ From: Zwitsmailto:daniel.van...@ortec-finance.com Sent: ‎7/‎2/‎2015 4:33 AM To:

.NET on Apache Spark?

2015-07-02 Thread Zwits
I'm currently looking into a way to run a program/code (DAG) written in .NET on a cluster using Spark. However I ran into problems concerning the coding language, Spark has no .NET API. I tried looking into IronPython because Spark does have a Python API, but i couldn't find a way to use this. Is

Array fields in dataframe.write.jdbc

2015-07-02 Thread Anand Nalya
Hi, I'm using spark 1.4. I've a array field in my data frame and when I'm trying to write this dataframe to postgres, I'm getting the following exception: Exception in thread main java.lang.IllegalArgumentException: Can't translate null value for field

RE: Making Unpersist Lazy

2015-07-02 Thread Ganelin, Ilya
You may pass an optional parameter (blocking = false) to make it lazy. Thank you, Ilya Ganelin -Original Message- From: Jem Tucker [jem.tuc...@gmail.commailto:jem.tuc...@gmail.com] Sent: Thursday, July 02, 2015 04:06 AM Eastern Standard Time To: Akhil Das Cc: user Subject: Re: Making

Spark SQL and Streaming - How to execute JDBC Query only once

2015-07-02 Thread Ashish Soni
Hi All , I have and Stream of Event coming in and i want to fetch some additional data from the database based on the values in the incoming data , For Eg below is the data coming in loginName Email address city Now for each login name i need to go to oracle database and get the userId from

override/update options in Dataframe/JdbcRdd

2015-07-02 Thread manohar
Hi, what are the options in DataFrame/JdbcRdd save/saveAsTable api. is there any options to override/update a particular column in the table instead of whole table overriding based on some ID colum. SaveMode append is there but it wont help us to update the record,it will append/add new row to

Starting Spark without automatically starting HiveContext

2015-07-02 Thread Daniel Haviv
Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel

Re: making dataframe for different types using spark-csv

2015-07-02 Thread Kohler, Curt E (ELS-STL)
You should be able to do something like this (assuming an input file formatted as: String, IntVal, LongVal) import org.apache.spark.sql.types._ val recSchema = StructType(List(StructField(strVal, StringType, false), StructField(intVal, IntegerType,

Re: BroadcastHashJoin when RDD is not cached

2015-07-02 Thread Srikanth
Good to know this will be in next release. Thanks. On Wed, Jul 1, 2015 at 3:13 PM, Michael Armbrust mich...@databricks.com wrote: We don't know that the table is small unless you cache it. In Spark 1.5 you'll be able to give us a hint though (

Re: Meets class not found error in spark console with newly hive context

2015-07-02 Thread shenyan zhen
In case it helps: I got around it temporarily by saving and reseting the context class loader around creating HiveContext. On Jul 2, 2015 4:36 AM, Terry Hole hujie.ea...@gmail.com wrote: Found this a bug in spark 1.4.0: SPARK-8368 https://issues.apache.org/jira/browse/SPARK-8368 Thanks!

thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv
Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Ted Yu
Which Spark release are you using ? bq. yarn--jars I guess the above was just a typo in your email (missing space). Cheers On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars

Re: Spark driver hangs on start of job

2015-07-02 Thread Richard Marscher
Ah I see, glad that simple patch works for your problem. That seems to be a different underlying problem than we have been experiencing. In our case, the executors are failing properly, its just that none of the new ones will ever escape experiencing the same exact issue. So we start a death

Re: making dataframe for different types using spark-csv

2015-07-02 Thread Hafiz Mujadid
Thanks On Thu, Jul 2, 2015 at 5:40 PM, Kohler, Curt E (ELS-STL) c.koh...@elsevier.com wrote: You should be able to do something like this (assuming an input file formatted as: String, IntVal, LongVal) import org.apache.spark.sql.types._ val recSchema =

Dataframe in single partition after sorting?

2015-07-02 Thread Cesar Flores
I am sorting a data frame using something like: val sortedDF = df.orderBy(df(score).desc) The sorting is really fast. The issue I have is that after sorting, the resulting data frame sortedDF appears to be in a single partition, which is a problem because when I try to execute another operation

binaryFiles() for 1 million files, too much memory required

2015-07-02 Thread Kostas Kougios
Once again I am trying to read a directory tree using binary files. My directory tree has a root dir ROOTDIR and subdirs where the files are located, i.e. ROOTDIR/1 ROOTDIR/2 ROOTDIR/.. ROOTDIR/100 A total of 1 mil files split into 100 sub dirs Using binaryFiles requires too much memory on the

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-02 Thread Salih Oztop
Hi Suraj,It seems your requirement is Record Linkage/Entity Resolution.https://en.wikipedia.org/wiki/Record_linkage http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf A presentation from Spark Summit using

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
Hi Sim, Seems you already set the PermGen size to 256m, right? I notice that in your the shell, you created a HiveContext (it further increased the memory consumption on PermGen). But, spark shell has already created a HiveContext for you (sqlContext. You can use asInstanceOf to access

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Simeon Simeonov
Same error with the new code: import org.apache.spark.sql.hive.HiveContext val ctx = sqlContext.asInstanceOf[HiveContext] import ctx.implicits._ val df = ctx.jsonFile(file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-0.gz) df.registerTempTable(training) val dfCount =

Spark Thriftserver exec insert sql got error on Hadoop federation

2015-07-02 Thread Xiaoyu Wang
Hi all! My sql case is: insert overwrite table test1 select * From test; In the job end got move file error. I see hive-0.13.1 support for viewfs is not good. until hive-1.1.0+ How to upgrade the hive version for spark? Or how to fix the bug on org.spark-project.hive. My version: Spark version

Re: Spark launching without all of the requested YARN resources

2015-07-02 Thread Arun Luthra
Thanks Sandy et al, I will try that. I like that I can choose the minRegisteredResourcesRatio. On Wed, Jun 24, 2015 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Arun, You can achieve this by setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really high number

duplicate names in sql allowed?

2015-07-02 Thread Koert Kuipers
i am surprised this is allowed... scala sqlContext.sql(select name as boo, score as boo from candidates).schema res7: org.apache.spark.sql.types.StructType = StructType(StructField(boo,StringType,true), StructField(boo,IntegerType,true)) should StructType check for duplicate field names?

Aggregating the same column multiple times

2015-07-02 Thread sim
What is the rationale for not allowing the same column in a GroupedData to be aggregated more than once using agg, especially when the method signature def agg(aggExpr: (String, String), aggExprs: (String, String)*) allows passing something like agg(x - sum, x =avg)? -- View this message in

import pyspark.sql.Row gives error in 1.4.1

2015-07-02 Thread Krishna Sankar
Error - ImportError: No module named Row Cheers enjoy the long weekend k/

回复:All master are unreponsive issue

2015-07-02 Thread luohui20001
Hi there, i check the source code and found that in org.apache.spark.deploy.client.AppClient, there is a parameter tells(line 52): val REGISTRATION_TIMEOUT = 20.seconds val REGISTRATION_RETRIES = 3As I know If I wanna increase the retry times, must I modify this value,rebuild the

Spark MLLib 140 - logistic regression with SGD model accuracy is different in local mode and cluster mode

2015-07-02 Thread Nirmal Fernando
Hi All, I'm facing a quite strange case, where after migrating to Spark 140, I'm seen SparkMLLib produces different results when runs on local mode and cluster mode. Is there any possibility of that happening? (I feel this is an issue in my environment, but just wanted to get confirmed.) Thanks.

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-07-02 Thread Tobias Pfeiffer
Hi, On Thu, Jan 29, 2015 at 9:52 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Thu, Jan 29, 2015 at 1:54 AM, YaoPau jonrgr...@gmail.com wrote: My thinking is to maintain state in an RDD and update it an persist it with each 2-second pass, but this also seems like it could get messy.

Fwd: map vs foreach for sending data to external system

2015-07-02 Thread Alexandre Rodrigues
Hi Spark devs, I'm coding a spark job and at a certain point in execution I need to send some data present in an RDD to an external system. val myRdd = myRdd.foreach { record = sendToWhtv(record) } The thing is that foreach forces materialization of the RDD and it seems to be executed

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi
*The thing is that foreach forces materialization of the RDD and it seems to be executed on the driver program* What makes you think that? No, foreach is run in the executors (distributed) and not in the driver. 2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues alex.jose.rodrig...@gmail.com: Hi

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv
Hi, I'm using 1.4. It's indeed a typo in the email itself. Thanks, Daniel On Thu, Jul 2, 2015 at 6:06 PM, Ted Yu yuzhih...@gmail.com wrote: Which Spark release are you using ? bq. yarn--jars I guess the above was just a typo in your email (missing space). Cheers On Thu, Jul 2, 2015 at

Re: map vs foreach for sending data to external system

2015-07-02 Thread Silvio Fiorito
foreach absolutely runs on the executors. For sending data to an external system you should likely use foreachPartition in order to batch the output. Also if you want to limit the parallelism of the output action then you can use coalesce. What makes you think foreach is running on the driver?

Re: Grouping runs of elements in a RDD

2015-07-02 Thread Mohit Jaggi
if you are joining successive lines together based on a predicate, then you are doing a flatMap not an aggregate. you are on the right track with a multi-pass solution. i had the same challenge when i needed a sliding window over an RDD(see below). [ i had suggested that the sliding window API be

Re: Check for null in PySpark DataFrame

2015-07-02 Thread Pedro Rodriguez
Thanks for the tip. Any idea why the intuitive answer doesn't work ( != None)? I inspected the Row columns and they do indeed have a None value. I would suspect that somehow Python's None is translated to something in jvm which doesn't equal to null? I might check out the source code for a better

Re: map vs foreach for sending data to external system

2015-07-02 Thread Alexandre Rodrigues
Foreach is listed as an action[1]. I guess an *action* just means that it forces materialization of the RDD. I just noticed much faster executions with map although I don't like the map approach. I'll look at it with new eyes if foreach is the way to go. [1] –

Re: Spark SQL and Streaming - How to execute JDBC Query only once

2015-07-02 Thread Raghavendra Pandey
This will not work i.e. using data frame inside map function.. Although you can try to create df separately n cache it... Then you can join your event stream with this df. On Jul 2, 2015 6:11 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , I have and Stream of Event coming in and i want

Re: KMeans questions

2015-07-02 Thread Feynman Liang
SPARK-7879 https://issues.apache.org/jira/browse/SPARK-7879 seems to address your use case (running KMeans on a dataframe and having the results added as an additional column) On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman eric.d.fried...@gmail.com wrote: In preparing a DataFrame (spark 1.4) to

wholeTextFiles(/x/*/*.txt) runs single threaded

2015-07-02 Thread Kostas Kougios
Hi, I got a cluster of 4 machines and I sc.wholeTextFiles(/x/*/*.txt) folder x contains subfolders and each subfolder contains thousand of files with a total of ~1million matching the path expression. My spark task starts processing the files but single threaded. I can see that in the sparkUI,

Re: wholeTextFiles(/x/*/*.txt) runs single threaded

2015-07-02 Thread Kostas Kougios
In SparkUI I can see it creating 2 stages. I tried wholeTextFiles().repartition(32) but same threading results. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html Sent from the Apache Spark User List

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-02 Thread Raghavendra Pandey
You can collect the dataframe as array n then create map out of it..., On Jul 2, 2015 9:23 AM, asoni.le...@gmail.com wrote: Any example how can i return a Hashmap from data frame ? Thanks , Ashish On Jul 1, 2015, at 11:34 PM, Holden Karau hol...@pigscanfly.ca wrote: Collecting it as a

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi
Heh, an actions or materializaiton, means that it will trigger the computation over the RDD. A transformation like map, means that it will create the transformation chain that must be applied on the data, but it is actually not executed. It is executed only when an action is triggered over that

sliding

2015-07-02 Thread tog
Hi Sorry for this scala/spark newbie question. I am creating RDD which represent large time series this way: val data = sc.textFile(somefile.csv) case class Event( time: Double, x: Double, vztot: Double ) val events = data.filter(s = !s.startsWith(GMT)).map{s =

Re: map vs foreach for sending data to external system

2015-07-02 Thread Alexandre Rodrigues
What I'm doing in the RDD is parsing a text file and sending things to the external system.. I guess that it does that immediately when the action (count) is triggered instead of being a two step process. So I guess I should have parsing logic + sending to external system inside the foreach (with

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Benjamin Fradet
Hi, You can set those parameters through the spark.executor.extraJavaOptions Which is documented in the configuration guide: spark.apache.org/docs/latest/configuration.htnl On 2 Jul 2015 9:06 pm, Mulugeta Mammo mulugeta.abe...@gmail.com wrote: Hi, I'm running Spark 1.4.0, I want to specify

1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread sim
A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1 and fails with a series of out-of-memory errors in 1.4.0. This gist https://gist.github.com/ssimeonov/a49b75dc086c3ac6f3c4 includes the code and the full output from the 1.3.1 and 1.4.0 runs, including the command line

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist
Yes, that does appear to be the case. The documentation is very clear about the heap settings and that they can not be used with spark.executor.extraJavaOptions spark.executor.extraJavaOptions(none)A string of extra JVM options to pass to executors. For instance, GC settings or other logging.

Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-07-02 Thread Davies Liu
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl a...@whisperstream.com wrote: In pyspark, when I convert from rdds to dataframes it looks like the rdd is being materialized/collected/repartitioned before it's converted to a dataframe. It's not true. When converting a RDD to dataframe, it only take

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Mulugeta Mammo
tried that one and it throws error - extraJavaOptions is not allowed to alter memory settings, use spakr.executor.memory instead. On Thu, Jul 2, 2015 at 12:21 PM, Benjamin Fradet benjamin.fra...@gmail.com wrote: Hi, You can set those parameters through the spark.executor.extraJavaOptions

Re: Grouping runs of elements in a RDD

2015-07-02 Thread RJ Nowling
Thanks, Mohit. It sounds like we're on the same page -- I used a similar approach. On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi mohitja...@gmail.com wrote: if you are joining successive lines together based on a predicate, then you are doing a flatMap not an aggregate. you are on the right

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
Hi Sim, Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3 (explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you add --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m in the command you used to launch Spark shell? This will increase the PermGen size

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist
You should use: spark.executor.memory from the docs https://spark.apache.org/docs/latest/configuration.html: spark.executor.memory512mAmount of memory to use per executor process, in the same format as JVM memory strings (e.g.512m, 2g). -Todd On Thu, Jul 2, 2015 at 3:36 PM, Mulugeta Mammo

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Mulugeta Mammo
thanks but my use case requires I specify different start and max heap sizes. Looks like spark sets start and max sizes same value. On Thu, Jul 2, 2015 at 1:08 PM, Todd Nist tsind...@gmail.com wrote: You should use: spark.executor.memory from the docs

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Mulugeta Mammo
Ya, I think its a limitation too.I looked at the source code, SparkConf.scala and ExecutorRunnable.scala both Xms and Xmx are set equal value which is spark.executor.memory. Thanks On Thu, Jul 2, 2015 at 1:18 PM, Todd Nist tsind...@gmail.com wrote: Yes, that does appear to be the case. The

Re: sliding

2015-07-02 Thread Feynman Liang
How about: events.sliding(3).zipWithIndex.filter(_._2 % 3 == 0) That would group the RDD into adjacent buckets of size 3. On Thu, Jul 2, 2015 at 2:33 PM, tog guillaume.all...@gmail.com wrote: Was complaining about the Seq ... Moved it to val eventsfiltered = events.sliding(3).map(s =

where is the source code for org.apache.spark.launcher.Main?

2015-07-02 Thread Shiyao Ma
Hi, It seems to me spark launches a process to read the spark-deaults.conf and then launch another process to do the app stuff. The code here should confirm it: https://github.com/apache/spark/blob/master/bin/spark-class#L76 $RUNNER -cp $LAUNCH_CLASSPATH org.apache.spark.launcher.Main $@ But

Re: sliding

2015-07-02 Thread tog
Was complaining about the Seq ... Moved it to val eventsfiltered = events.sliding(3).map(s = Event(s(0).time, (s(0).x+s(1).x+s(2).x)/3.0 (s(0).vztot+s(1).vztot+s(2).vztot)/3.0)) and that is working. Anyway this is not what I wanted to do, my goal was more to implement bucketing to shorten the

Re: where is the source code for org.apache.spark.launcher.Main?

2015-07-02 Thread Shiyao Ma
After clicking the github spark repo, it is clearly here: https://github.com/apache/spark/tree/master/launcher/src/main/java/org/apache/spark/launcher My intellij project sidebar was fully expanded and I was lost in anther folder. Problem solved.

configuring max sum of cores and memory in cluster through command line

2015-07-02 Thread Alexander Waldin
Hi, I'd like to specify the total sum of cores / memory as command line arguments with spark-submit. That is, I'd like to set yarn.nodemanager.resource.memory-mb and the yarn.nodemanager.resource.cpu-vcores parameters as described in this blog

Re: sliding

2015-07-02 Thread tog
Well it did reduce the length of my serie of events. I will have to dig what it did actually ;-) I would assume that it took one out of 3 value, is that correct ? Would it be possible to control a bit more how the value assigned to the bucket is computed for example take the first element, the

Re: sliding

2015-07-02 Thread Feynman Liang
Consider an example dataset [a, b, c, d, e, f] After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)] After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1), ((c, d, e), 2), ((d, e, f), 3)] After filter: [((a,b,c), 0), ((d, e, f), 3)], which is what I'm assuming you want (non-overlapping

Re: sliding

2015-07-02 Thread tog
Understood. Thanks for your great help Cheers Guillaume On 2 July 2015 at 23:23, Feynman Liang fli...@databricks.com wrote: Consider an example dataset [a, b, c, d, e, f] After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)] After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1),