Re: how to avoid reading the first line of dataframe?

2013-09-24 Thread Nathan Kronenfeld
ally have headers in the first row, how can I avoid >>>> reading the first row? >>>> I know in hadoop, I can figure it out by the line number. >>>> >>>> Best >>>> >>> >>> >> > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: how to avoid reading the first line of dataframe?

2013-09-24 Thread Nathan Kronenfeld
st partition. > > i.e. > > data.mapPartitionsWithIndex { case (iter, index) => > if (index == 0) iter.drop(1) else iter > } > > > -- > Reynold Xin, AMPLab, UC Berkeley > http://rxin.org > > > > On Tue, Sep 24, 2013 at 11:10 PM, Nathan Kronenfeld &l

Re: Visitor function to RDD elements

2013-10-22 Thread Nathan Kronenfeld
You shouldn't have to fly data around You can just run it first on partition 0, then on partition 1, etc... I may have the name slightly off, but something approximately like: for (p <- 0 until numPartitions) data.mapPartitionsWithIndex((i, iter) => if (0 == p) iter.map(fcn) else List().iterat

Re: almost sorted data

2013-10-25 Thread Nathan Kronenfeld
d(*almost*) by > timestamp. > If I do a full sort it takes a lot of time. Is there some way to sort more > efficiently (like restricting sort to per partition). > > Thanks in advance > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Tor

Re: almost sorted data

2013-10-25 Thread Nathan Kronenfeld
e) in the right location On Fri, Oct 25, 2013 at 10:17 AM, Sebastian Schelter wrote: > Using a local sort per partition only gives a correct result if the data > is already range partitioned. > > On 25.10.2013 16:11, Nathan Kronenfeld wrote: > > Since no one else has a

Re: almost sorted data

2013-10-28 Thread Nathan Kronenfeld
orithms (https://en.wikipedia.org/wiki/Adaptive_sort) >>> can benefit from presortedness in their inputs, so that might be a >>> helpful search term for researching this problem. >>> >>> >>> On Fri, Oct 25, 2013 at 7:23 AM, Nathan Kronenfeld < >>

build problem

2013-11-28 Thread Nathan Kronenfeld
re? Any help very much appreciated, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Problem connecting to cluster

2013-11-28 Thread Nathan Kronenfeld
ng - anyone have any clue? Thanks in advance, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Cluster not accepting jobs

2013-12-06 Thread Nathan Kronenfeld
ssing something in the setup - does anyone know what? Thanks in advance, -Nathan Kronenfeld

Re: Cluster not accepting jobs

2013-12-06 Thread Nathan Kronenfeld
Never mind, I figured it out - apparently it was different DNS resolutions locally and within the cluster; when I use the IP address instead of the machine name in MASTER, it all seems to work. On Fri, Dec 6, 2013 at 1:38 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: >

Fwd: Spark forum question

2013-12-11 Thread Nathan Kronenfeld
rary/_attempt_201312111200__m_00_0/part-0) seems suspect since I'm running from a cmd shell. Running from a cygwin shell leads to other errors. Has anyone's been able to get simple file output to run from either a cygwin shell or the windows cmd shell? Does anyone knwo

Re: Fwd: Spark forum question

2013-12-11 Thread Nathan Kronenfeld
rk-output". > > > On 12/11/2013 10:45 AM, Nathan Kronenfeld wrote: > > We are trying to test out running Spark 0.8.0 on a Windows box, and > while we can get it to run all the examples that don't output results to > disk, we can't get it to write output.. >

Re: Fwd: Spark forum question

2013-12-11 Thread Nathan Kronenfeld
Oops. Stupid mail client. Sorry about that When we change res.saveAsTextFile("file:///c:/some/path") to res.saveAsTextFile("path") and run it from c:\some, we get exactly the same error. -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkel

Re: Spark forum question

2013-12-11 Thread Nathan Kronenfeld
ng it without Cygwin on your PATH. It seems that it’s trying to > call chmod and such and failing. > > Matei > > > On Dec 11, 2013, at 10:38 AM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com> wrote: > > Oops. Stupid mail client. Sorry about that > > When we

Re: IOException - Cannot run program "cygpath": ....

2013-12-11 Thread Nathan Kronenfeld
en(SparkHadoopWriter.scala:86) > > > org.apache.spark.rdd.PairRDDFunctions.writeToFile$1(PairRDDFunctions.scala:667) > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680) > > > > > > > > Environment > &

Re: IOException - Cannot run program "cygpath": ....

2013-12-11 Thread Nathan Kronenfeld
changes. If you like, I can mail you (or just post) the changes here if that helps you, or you can wait a day or two while we put a pull request together with the necessary changes in it. -Nathan On Wed, Dec 11, 2013 at 2:22 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com>

FileNotFoundException running spark job

2013-12-17 Thread Nathan Kronenfeld
11094 Any clue what this means? I've checked open files on the worker node while this task is going (by running lsof | wc -l every 5 seconds) and I don't even see a blip - it looks nice and steady, with no problems. -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Be

Spark streaming vs. spark usage

2013-12-17 Thread Nathan Kronenfeld
ere a different paradigm of working with both I'm just missing? -Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: FileNotFoundException running spark job

2013-12-17 Thread Nathan Kronenfeld
tually, I did eventually find some FileNotFound exception in worker logs too - both on machines the client reported had problems, and on other machines. -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Re: Spark streaming vs. spark usage

2013-12-18 Thread Nathan Kronenfeld
tream might have any number of other functions, but if you were just using the basic ones, you would just call existingBatchStuff(rdd) or existingBatchStuff(dstream) -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone

Re: RDD API question

2014-02-14 Thread Nathan Kronenfeld
> > > K2 D > > K2 D > > K2 E > > > > and I want to create > > > > A B > > A C > > B C > > D D > > D E > > > > Whats the best way to do this? If I join the RDD with itself, I will end > up > > with A A which I do no

Re: RDD API question

2014-02-14 Thread Nathan Kronenfeld
Sorry, that was a bit incomplete. Comments below: On Fri, Feb 14, 2014 at 10:58 AM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > Assuming no set has so many combinations it won't fit on worker, the > following works: > > val data = sc.parallelize(List(("

Re: How to achieve this in Spark

2014-02-19 Thread Nathan Kronenfeld
Id) ) > > However, there is no method "contains" so I'm looking for the most > efficient way to achieving this in Spark. > > Thanks. > > > > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com

Trying to connect to spark from within a web server

2014-02-21 Thread Nathan Kronenfeld
en a sign of an attempted connection. I'm trying to use a JavaSparkContext, and I've printed out the parameters I pass in, and they work fine in a stand-alone program. Anyone have a clue why this fails? Or even how to find out why this fals? -- Nathan Kronenfeld Senior Visualization De

Re: Trying to connect to spark from within a web server

2014-02-21 Thread Nathan Kronenfeld
tstat -an |grep 7077 > This will give you which IP to bind to exactly when launching spark > context. > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On F

Re: Trying to connect to spark from within a web server

2014-02-22 Thread Nathan Kronenfeld
ing > workers & when you connect to that IP with javaAPI the cluster appears to > be down to it? > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On Fr

Re: Disable all spark logging

2014-02-23 Thread Nathan Kronenfeld
s message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Disable-all-spark-logging-tp1960.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, S