Re: Problem reading in LZO compressed files

2014-07-13 Thread Nicholas Chammas
think something is not linked to something properly (not a Java expert unfortunately). Thanks! Ognen On 7/13/14, 10:35 AM, Nicholas Chammas wrote: If you’re still seeing gibberish, it’s because Spark is not using the LZO libraries properly. In your case, I believe you should be calling

Re: Possible bug in ClientBase.scala?

2014-07-13 Thread Nicholas Chammas
On Sun, Jul 13, 2014 at 9:49 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: I can easily fix this by changing this to YarnConfiguration instead of MRJobConfig but was wondering what the steps are for submitting a fix. Relevant links: -

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
For example, are LIKE 'string%' queries supported? Trying one on 1.0.1 yields java.lang.ExceptionInInitializerError. Nick ​ On Sat, Jul 12, 2014 at 10:16 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Is there a place where we can find an up-to-date list of supported SQL syntax in Spark

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
Actually, this looks like its some kind of regression in 1.0.1, perhaps related to assembly and packaging with spark-ec2. I don’t see this issue with the same data on a 1.0.0 EC2 cluster. How can I trace this down for a bug report? Nick ​ On Sun, Jul 13, 2014 at 11:18 PM, Nicholas Chammas

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
Are you sure the code running on the cluster has been updated? I launched the cluster using spark-ec2 from the 1.0.1 release, so I’m assuming that’s taken care of, at least in theory. I just spun down the clusters I had up, but I will revisit this tomorrow and provide the information you

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Nicholas Chammas
, Jul 12, 2014 at 3:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: To add a potentially relevant piece of information, around when I stop the StreamingContext, I get the following warning: 14/07/12 22:16:18 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0

Re: Restarting a Streaming Context

2014-07-10 Thread Nicholas Chammas
Okie doke. Thanks for the confirmation, Burak and Tathagata. On Thu, Jul 10, 2014 at 2:23 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I confirm that is indeed the case. It is designed to be so because it keeps things simpler - less chances of issues related to cleanup when stop()

Re: Difference between SparkSQL and shark

2014-07-10 Thread Nicholas Chammas
In short, Spark SQL is the future, built from the ground up. Shark was built as a drop-in replacement for Hive, will be retired, and will perhaps be replaced by a future initiative to run Hive on Spark https://issues.apache.org/jira/browse/HIVE-7292. More info: -

Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Awww ye. That worked! Thank you Sameer. Is this documented somewhere? I feel there there's a slight doc deficiency here. Nick On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak ssti...@live.com wrote: Hi Nicholas, I am using Spark 1.0 and I use this method to specify the additional jars.

Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
changes in its API for us. Nick On Wed, Jul 9, 2014 at 3:34 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Awww ye. That worked! Thank you Sameer. Is this documented somewhere? I feel there there's a slight doc deficiency here. Nick On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak

Re: reading compress lzo files

2014-07-07 Thread Nicholas Chammas
I found it quite painful to figure out all the steps required and have filed SPARK-2394 https://issues.apache.org/jira/browse/SPARK-2394 to track improving this. Perhaps I have been going about it the wrong way, but it seems way more painful than it should be to set up a Spark cluster built using

Re: reading compress lzo files

2014-07-06 Thread Nicholas Chammas
, Nicholas Chammas wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count

Re: Addind and subtracting workers on Spark EC2 cluster

2014-07-06 Thread Nicholas Chammas
On Sun, Jul 6, 2014 at 10:10 AM, Robert James srobertja...@gmail.com wrote: If I've created a Spark EC2 cluster, how can I add or take away workers? There is a manual process by which this is possible, but I’m not sure of the procedure. There is also SPARK-2008

Re: reading compress lzo files

2014-07-05 Thread Nicholas Chammas
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop .mapreduce.LzoTextInputFormat,org.apache.hadoop .io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Nicholas Chammas
On Thu, Jun 26, 2014 at 2:26 PM, Michael Bach Bui free...@adatao.com wrote: The overhead of bringing up a AWS Spark spot instances is NOT the inherent problem of Spark. That’s technically true, but I’d be surprised if there wasn’t a lot of room for improvement in spark-ec2 regarding cluster

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Nicholas Chammas
not sure. ​ On Thu, Jun 26, 2014 at 4:48 PM, Aureliano Buendia buendia...@gmail.com wrote: On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That’s technically true, but I’d be surprised if there wasn’t a lot of room for improvement in spark-ec2 regarding

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Nicholas Chammas
of 2171 S3 files, with an average size of about 18MB. On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: What do you get for rdd1._jrdd.splits().size()? You might think you’re getting 100 partitions, but it may not be happening. ​ On Tue, Jun 24, 2014 at 3

Re: increasing concurrency of saveAsNewAPIHadoopFile?

2014-06-19 Thread Nicholas Chammas
The main thing that will affect the concurrency of any saveAs...() operations is a) the number of partitions of your RDD, and b) how many cores your cluster has. How big is the RDD in question? How many partitions does it have? ​ On Thu, Jun 19, 2014 at 3:38 PM, Sandeep Parikh

Re: Wildcard support in input path

2014-06-18 Thread Nicholas Chammas
Is that month= syntax something special, or do your files actually have that string as part of their name? ​ On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi all, Thanks for the reply. I'm using parquetFile as input, is that a problem? In hadoop fs -ls, the

Re: Wildcard support in input path

2014-06-18 Thread Nicholas Chammas
the partitions. It's part of the url of my files. Jianshi On Wed, Jun 18, 2014 at 11:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is that month= syntax something special, or do your files actually have that string as part of their name? ​ On Wed, Jun 18, 2014 at 2:25 AM

Re: Spark is now available via Homebrew

2014-06-18 Thread Nicholas Chammas
Agreed, it would be better if Apache controlled or managed this directly. I think making such a change is just a matter of opening a new issue https://github.com/Homebrew/homebrew/issues/new on the Homebrew issue tracker. I believe that's how Spark made it in there in the first place--it was just

Re: Spark is now available via Homebrew

2014-06-18 Thread Nicholas Chammas
Matei, You might want to comment on that issue Sherl linked to, or perhaps this one https://github.com/Homebrew/homebrew/issues/30228, to ask about how Apache can manage this going forward. I know that mikemcquaid https://github.com/mikemcquaid is very active on the Homebrew repo and is one of

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
Ah, this looks like exactly what I need! It looks like this was recently added into PySpark https://github.com/apache/spark/pull/705/files#diff-6 (and Spark Core), but it's not in the 1.0.0 release. Thank you. Nick On Wed, Jun 18, 2014 at 7:42 PM, Doris Xin doris.s@gmail.com wrote: Hi

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
That’s pretty neat! So I guess if you start with an RDD of objects, you’d first do something like RDD.map(lambda x: Record(x['field_1'], x['field_2'], ...)) in order to register it as a table, and from there run your aggregates. Very nice. ​ On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
: If your input data is JSON, you can also try out the recently merged in initial JSON support: https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916 On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That’s pretty neat! So I guess

Re: list of persisted rdds

2014-06-13 Thread Nicholas Chammas
This appears to be missing from PySpark. Reported in SPARK-2141 https://issues.apache.org/jira/browse/SPARK-2141. On Fri, Jun 13, 2014 at 10:43 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: val myRdds = sc.getPersistentRDDs assert(myRdds.size === 1) It'll return a map. Its

Re: list of persisted rdds

2014-06-13 Thread Nicholas Chammas
Yeah, unfortunately PySpark still lags behind the Scala API a bit, but it's being patched up at a good pace. On Fri, Jun 13, 2014 at 1:43 PM, mrm ma...@skimlinks.com wrote: Hi Nick, Thank you for the reply, I forgot to mention I was using pyspark in my first message. Maria -- View

Re: process local vs node local subtlety question/issue

2014-06-13 Thread Nicholas Chammas
On Fri, Jun 13, 2014 at 1:55 PM, Albert Chu ch...@llnl.gov wrote: 1) How is this data process-local? I *just* copied it into HDFS. No spark worker or executor should have loaded it. Yeah, I thought that PROCESS_LOCAL meant the data was already in the JVM on the worker node, but I do see the

Re: creating new ami image for spark ec2 commands

2014-06-12 Thread Nicholas Chammas
Yeah, we badly need new AMIs that include at a minimum package/security updates and Python 2.7. There is an open issue to track the 2.7 AMI update https://issues.apache.org/jira/browse/SPARK-922, at least. On Thu, Jun 12, 2014 at 3:34 PM, unorthodox.engine...@gmail.com wrote: Creating AMIs

Re: Question about RDD cache, unpersist, materialization

2014-06-12 Thread Nicholas Chammas
FYI: Here is a related discussion http://apache-spark-user-list.1001560.n3.nabble.com/Persist-and-unpersist-td6437.html about this. On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: Maybe It would be nice that unpersist() ‘triggers’ the computations

Re: Using Spark to crack passwords

2014-06-11 Thread Nicholas Chammas
Yes, I mean the RDD would just have elements to define partitions or ranges within the search space, not have actual hashes. It's really just a using the RDD as a control structure, rather than a real data set. As you noted, we don't need to store any hashes. We just need to check them as they

Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Nicholas Chammas
In PySpark you can also do help(my_rdd) and get a nice help page of methods available. 2014년 6월 8일 일요일, Cartergyz...@hotmail.com님이 작성한 메시지: Thank you very much Gerard. -- View this message in context:

Re: Why Scala?

2014-06-06 Thread Nicholas Chammas
To add another note on the benefits of using Scala to build Spark, here is a very interesting and well-written post http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html on the Databricks blog about how Scala 2.10's runtime reflection enables

Re: SocketException when reading from S3 (s3n format)

2014-06-04 Thread Nicholas Chammas
I think by default a thread can die up to 4 times before Spark considers it a failure. Are you seeing that happen? I believe that is a configurable thing, but don't know off the top of my head how to change it. I've seen this error before when reading data from a large amount of files on S3, and

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-04 Thread Nicholas Chammas
On Wed, Jun 4, 2014 at 9:35 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Oh, I went back to m1.large while those issues get sorted out. Random side note, Amazon is deprecating the m1 instances in favor of m3 instances, which have SSDs and more ECUs than their m1 counterparts.

Re: Having spark-ec2 join new slaves to existing cluster

2014-06-03 Thread Nicholas Chammas
On Tue, Jun 3, 2014 at 6:52 AM, sirisha_devineni sirisha_devin...@persistent.co.in wrote: Did you open a JIRA ticket for this feature to be implemented in spark-ec2? If so can you please point me to the ticket? Just created it: https://issues.apache.org/jira/browse/SPARK-2008 Nick

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
with the _success folder. In any case this change of behavior should be documented IMO. Cheers Pierre Message sent from a mobile device - excuse typos and abbreviations Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a écrit : What I've found using saveAsTextFile

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
OK, thanks for confirming. Is there something we can do about that leftover part- files problem in Spark, or is that for the Hadoop team? 2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지: Yes. On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
method if you like), but I can see it both ways. Caller beware. On Mon, Jun 2, 2014 at 10:08 PM, Nicholas Chammas nicholas.cham...@gmail.com javascript:; wrote: OK, thanks for confirming. Is there something we can do about that leftover part- files problem in Spark, or is that for the Hadoop

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com wrote: (B) Semantics in Spark 1.0 and earlier: Do you mean 1.0 and later? Option (B) with the exception-on-clobber sounds fine to me, btw. My use pattern is probably common but not universal, and deleting user files is indeed

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified

Re: Spark on EC2

2014-06-01 Thread Nicholas Chammas
No, you don't have to set up your own AMI. Actually it's probably simpler and less error prone if you let spark-ec2 manage that for you as you first start to get comfortable with Spark. Just spin up a cluster without any explicit mention of AMI and it will do the right thing. 2014년 6월 1일 일요일,

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Nicholas Chammas
(,) sc.textFile(fileStr)... - Patrick On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: YES, your hunches were correct. I’ve identified at least one file among the hundreds I’m processing that is indeed not a valid gzip file. Does anyone know of an easy way

Re: Announcing Spark 1.0.0

2014-05-30 Thread Nicholas Chammas
You guys were up late, eh? :) I'm looking forward to using this latest version. Is there any place we can get a list of the new functions in the Python API? The release notes don't enumerate them. Nick On Fri, May 30, 2014 at 10:15 AM, Ian Ferreira ianferre...@hotmail.com wrote: Congrats

Re: Persist and unpersist

2014-05-27 Thread Nicholas Chammas
Daniel, Is SPARK-1103 https://issues.apache.org/jira/browse/SPARK-1103 related to your example? Automatic unpersist()-ing of unreferenced RDDs would be nice. Nick ​ On Tue, May 27, 2014 at 12:28 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: I keep bumping into a problem with

Re: Using Spark to analyze complex JSON

2014-05-23 Thread Nicholas Chammas
in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 22 May 2014 04:43, Nicholas Chammas nicholas.cham...@gmail.com wrote: That's a good idea. So you're saying create a SchemaRDD

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Nicholas Chammas
Looking forward to that update! Given a table of JSON objects like this one: { name: Nick, location: { x: 241.6, y: -22.5 }, likes: [ice cream, dogs, Vanilla Ice]} It would be SUPER COOL if we could query that table in a way that is as natural as follows: SELECT

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Nicholas Chammas
, May 22, 2014 at 5:32 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looking forward to that update! Given a table of JSON objects like this one: { name: Nick, location: { x: 241.6, y: -22.5 }, likes: [ice cream, dogs, Vanilla Ice

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Nicholas Chammas
Thanks for the suggestions, people. I will try to hone in on which specific gzipped files, if any, are actually corrupt. Michael, I’m using Hadoop 1.0.4, which I believe is the default version that gets deployed by spark-ec2. The JIRA issue I linked to earlier,

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas
Any tips on how to troubleshoot this? On Thu, May 15, 2014 at 4:15 PM, Nick Chammas nicholas.cham...@gmail.comwrote: I’m trying to do a simple count() on a large number of GZipped files in S3. My job is failing with the following message: 14/05/15 19:12:37 WARN scheduler.TaskSetManager:

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas
Yes, it does work with fewer GZipped files. I am reading the files in using sc.textFile() and a pattern string. For example: a = sc.textFile('s3n://bucket/2014-??-??/*.gz') a.count() Nick ​ On Tue, May 20, 2014 at 10:09 PM, Madhu ma...@madhu.com wrote: I have read gzip files from S3

Re: Using mongo with PySpark

2014-05-17 Thread Nicholas Chammas
Where's your driver code (the code interacting with the RDDs)? Are you getting serialization errors? 2014년 5월 17일 토요일, Samarth Mailinglistmailinglistsama...@gmail.com님이 작성한 메시지: Hi all, I am trying to store the results of a reduce into mongo. I want to share the variable collection in the

Re: Worker re-spawn and dynamic node joining

2014-05-17 Thread Nicholas Chammas
Thanks for the info about adding/removing nodes dynamically. That's valuable. 2014년 5월 16일 금요일, Akhil Dasak...@sigmoidanalytics.com님이 작성한 메시지: Hi Han :) 1. Is there a way to automatically re-spawn spark workers? We've situations where executor OOM causes worker process to be DEAD and it does

Re: How to read a multipart s3 file?

2014-05-16 Thread Nicholas Chammas
On Wed, May 7, 2014 at 4:44 PM, Aaron Davidson ilike...@gmail.com wrote: Spark can only run as many tasks as there are partitions, so if you don't have enough partitions, your cluster will be underutilized. This is a very important point. kamatsuoka, how many partitions does your RDD have

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread Nicholas Chammas
Would cache() + count() every N iterations work just as well as checkPoint() + count() to get around this issue? We're basically trying to get Spark to avoid working on too lengthy a lineage at once, right? Nick On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng men...@gmail.com wrote: After

Re: How to read a multipart s3 file?

2014-05-12 Thread Nicholas Chammas
On Wed, May 7, 2014 at 4:00 AM, Han JU ju.han.fe...@gmail.com wrote: But in my experience, when reading directly from s3n, spark create only 1 input partition per file, regardless of the file size. This may lead to some performance problem if you have big files. You can (and perhaps should)

Re: logging in pyspark

2014-05-12 Thread Nicholas Chammas
, Nicholas Chammas nicholas.cham...@gmail.com wrote: I think you're looking for RDD.foreach()http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach . According to the programming guidehttp://spark.apache.org/docs/latest/scala-programming-guide.html : Run

Re: How to read a multipart s3 file?

2014-05-11 Thread Nicholas Chammas
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka ken...@gmail.com wrote: I was using s3n:// but I got frustrated by how slow it is at writing files. I'm curious: How slow is slow? How long does it take you, for example, to save a 1GB file to S3 using s3n vs s3?

Re: How to read a multipart s3 file?

2014-05-07 Thread Nicholas Chammas
Amazon also strongly discourages the use of s3:// because the block file system it maps to is deprecated. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html Note The configuration of Hadoop running on Amazon EMR differs from the default configuration

Re: logging in pyspark

2014-05-06 Thread Nicholas Chammas
I think you're looking for RDD.foreach()http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach . According to the programming guidehttp://spark.apache.org/docs/latest/scala-programming-guide.html : Run a function func on each element of the dataset. This is usually

Re: cache not work as expected for iteration?

2014-05-04 Thread Nicholas Chammas
Yes, persist/cache will cache an RDD only when an action is applied to it. On Sun, May 4, 2014 at 6:32 AM, Earthson earthson...@gmail.com wrote: thx for the help, unpersist is excatly what I want:) I see that spark will remove some cache automatically when memory is full, it is much more

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-04 Thread Nicholas Chammas
is an extension of the familiar hadoop distcp, of course. On Thu, May 1, 2014 at 11:41 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The fastest way to save to S3 should be to leave the RDD with many partitions, because all partitions will be written out in parallel. Then, once

Spark Training

2014-05-01 Thread Nicholas Chammas
There are many freely-available resources for the enterprising individual to use if they want to Spark up their life. For others, some structured training is in order. Say I want everyone from my department at my company to get something like the AMP Camphttp://ampcamp.berkeley.edu/experience,

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-01 Thread Nicholas Chammas
12:15 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1), you move everything in the RDD to a single partition, which then gives you 1 output file. It will still be called part-0 or something like

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Nicholas Chammas
Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1), you move everything in the RDD to a single partition, which then gives you 1 output file. It will still be called part-0 or something like that because that’s defined by the Hadoop API that Spark uses for

Re: processing s3n:// files in parallel

2014-04-28 Thread Nicholas Chammas
It would be useful to have some way to open multiple files at once into a single RDD (e.g. sc.textFile(iterable_over_uris)). Logically, it would be equivalent to opening a single file which is made by concatenating the various files together. This would only be useful, of course, if the source

Re: File list read into single RDD

2014-04-28 Thread Nicholas Chammas
not obvious. Nick 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com님이 작성한 메시지: Perfect. BTW just so I know where to look next time, was that in some docs? On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.comjavascript:_e(%7B%7D,'cvml','nicholas.cham...@gmail.com'); wrote

Re: Spark and HBase

2014-04-26 Thread Nicholas Chammas
to more traditional methods though. The only worry I have is that the Phoenix input format doesn't adequately split the data across multiple nodes, so that's something I will need to look at further. Josh On Apr 25, 2014, at 6:33 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Josh

Re: Spark and HBase

2014-04-25 Thread Nicholas Chammas
On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html . It looks like Apache

Re: Spark is slow

2014-04-22 Thread Nicholas Chammas
How long are the count() steps taking? And how many partitions are pairs1and triples initially divided into? You can see this by doing pairs1._jrdd.splits().size(), for example. If you just need to count the number of distinct keys, is it faster if you did the following instead of

Re: Spark is slow

2014-04-21 Thread Nicholas Chammas
I'm seeing the same thing as Marcelo, Joe. All your mail is going to my Spam folder. :( With regards to your questions, I would suggest in general adding some more technical detail to them. It will be difficult for people to give you suggestions if all they are told is Spark is slow. How does

Spark Streaming source from Amazon Kinesis

2014-04-21 Thread Nicholas Chammas
I'm looking to start experimenting with Spark Streaming, and I'd like to use Amazon Kinesis https://aws.amazon.com/kinesis/ as my data source. Looking at the list of supported Spark Streaming sourceshttp://spark.apache.org/docs/latest/streaming-programming-guide.html#linking, I don't see any

Re: GC overhead limit exceeded

2014-04-16 Thread Nicholas Chammas
Never mind. I'll take it from both Andrew and Syed's comments that the answer is yes. Dunno why I thought otherwise. On Wed, Apr 16, 2014 at 5:43 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I’m running into a similar issue as the OP. I’m running the same job over and over

Re: choose the number of partition according to the number of nodes

2014-04-16 Thread Nicholas Chammas
From the Spark tuning guidehttp://spark.apache.org/docs/latest/tuning.html : In general, we recommend 2-3 tasks per CPU core in your cluster. I think you can only get one task per partition to run concurrently for a given RDD. So if your RDD has 10 partitions, then 10 tasks at most can operate

Re: partitioning of small data sets

2014-04-15 Thread Nicholas Chammas
Looking at the Python version of textFile()http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile, shouldn't it be *max*(self.defaultParallelism, 2)? If the default parallelism is, say 4, wouldn't we want to use that for minSplits instead of 2? On Tue,

Re: Setting properties in core-site.xml for Spark and Hadoop to access

2014-04-11 Thread Nicholas Chammas
doesn't seem to pick them up. Nick On Fri, Mar 7, 2014 at 4:10 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Mayur, So looking at the section on environment variables herehttp://spark.incubator.apache.org/docs/latest/configuration.html#environment-variables, are you saying to set

Re: programmatic way to tell Spark version

2014-04-10 Thread Nicholas Chammas
, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey Patrick, I've created SPARK-1458https://issues.apache.org/jira/browse/SPARK-1458 to track this request, in case the team/community wants to implement it in the future. Nick On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
Marco, If you call spark-ec2 launch without specifying an AMI, it will default to the Spark-provided AMI. Nick On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini silvio.costant...@granatads.com wrote: Hi there, To answer your question; no there is no reason NOT to use an AMI that Spark has

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
And for the record, that AMI is ami-35b1885c. Again, you don't need to specify it explicitly; spark-ec2 will default to it. On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Marco, If you call spark-ec2 launch without specifying an AMI, it will default

Re: programmatic way to tell Spark version

2014-04-09 Thread Nicholas Chammas
Hey Patrick, I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to track this request, in case the team/community wants to implement it in the future. Nick On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: No use case at the moment

Re: Spark 0.9.1 released

2014-04-09 Thread Nicholas Chammas
A very nice addition for us PySpark users in 0.9.1 is the addition of RDD.repartition(), which is not mentioned in the release noteshttp://spark.apache.org/releases/spark-release-0-9-1.html ! This is super helpful for when you create an RDD from a gzipped file and then need to explicitly shuffle

Re: Spark 0.9.1 released

2014-04-09 Thread Nicholas Chammas
at 3:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: A very nice addition for us PySpark users in 0.9.1 is the addition of RDD.repartition(), which is not mentioned in the release noteshttp://spark.apache.org/releases/spark-release-0-9-1.html ! This is super helpful for when you

Re: Spark and HBase

2014-04-08 Thread Nicholas Chammas
Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html . It looks like Apache Phoenix aims to provide flexible SQL access to data, both for transactional and analytic

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nicholas Chammas
, logically, but, that's not to say that the machine it's on shouldn't do work. -- Sean Owen | Director, Data Science | London On Tue, Apr 8, 2014 at 8:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: So I have a cluster in EC2 doing some work, and when I take a look here http

Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi all, I have put this line

Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
. Setting it to a higher number won't do anything helpful. On Fri, Apr 4, 2014 at 2:47 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas
) .partitionBy(numPartitions) .map(lambda (counter, data): data)) If there's supposed to be a built-in Spark method to do this, I'd love to learn more about it. Nick On Tue, Apr 1, 2014 at 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hmm, doing help(rdd

Re: Spark output compression on HDFS

2014-04-02 Thread Nicholas Chammas
Is this a Scala-onlyhttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFilefeature? On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote: For textFile I believe we overload it and let you set a codec directly:

Re: Spark output compression on HDFS

2014-04-02 Thread Nicholas Chammas
, Apr 2, 2014 at 3:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is this a Scala-onlyhttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFilefeature? On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.comwrote: For textFile I

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas
/pyspark/rdd.py#L1128 On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Update: I'm now using this ghetto function to partition the RDD I get back when I call textFile() on a gzipped file: # Python 2.6 def partitionRDD(rdd, numPartitions): counter

Re: Efficient way to aggregate event data at daily/weekly/monthly level

2014-04-02 Thread Nicholas Chammas
Watch out with loading data from gzipped files. Spark cannot parallelize the load of gzipped files, and if you do not explicitly repartition your RDD created from such a file, everything you do on that RDD will run on a single core. On Wed, Apr 2, 2014 at 8:22 PM, K Koh den...@gmail.com wrote:

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas
to something splittable. Otherwise, if I want to speed up subsequent computation on the RDD, I should explicitly partition it with a call to RDD.partitionBy(10). Is that correct? On Mon, Mar 31, 2014 at 1:15 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: OK sweet. Thanks for walking

PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Just an FYI, it's not obvious from the docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat the following code should fail: a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) a._jrdd.splits().size() a.count() b = a.partitionBy(5)

Re: Cannot Access Web UI

2014-04-01 Thread Nicholas Chammas
Are you trying to access the UI from another machine? If so, first confirm that you don't have a network issue by opening the UI from the master node itself. For example: yum -y install lynx lynx ip_address:8080 If this succeeds, then you likely have something blocking you from accessing the

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Nicholas Chammas
...@mail.gmail.com%3E On Tue, Apr 1, 2014 at 1:51 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Alright, so I've upped the minSplits parameter on my call to textFile, but the resulting RDD still has only 1 partition, which I assume means it was read in on a single process. I am

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
in the latest docs. Sorry about that, I also didn't realize partitionBy() had this behavior from reading the Python docs (though it is consistent with the Scala API, just more type-safe there). On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just an FYI

Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
Howdy-doody, I have a single, very large file sitting in S3 that I want to read in with sc.textFile(). What are the best practices for reading in this file as quickly as possible? How do I parallelize the read as much as possible? Similarly, say I have a single, very large RDD sitting in memory

Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
31, 2014 at 9:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: So setting minSplitshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.context.SparkContext-class.html#textFile will set the parallelism on the read in SparkContext.textFile(), assuming I have the cores

<    1   2   3   4   >