Re: createDirectStream with offsets

2016-05-07 Thread Eric Friedman
the types you're passing in don't > match. For instance, you're passing in a message handler that returns > a tuple, but the rdd return type you're specifying (the 5th type > argument) is just String. > > On Fri, May 6, 2016 at 9:49 AM, Eric Friedman >

Re: createDirectStream with offsets

2016-05-06 Thread Eric Friedman
.1' compile 'org.apache.kafka:kafka_2.10:0.8.2.1' compile 'com.yammer.metrics:metrics-core:2.2.0' On Fri, May 6, 2016 at 7:47 AM, Eric Friedman wrote: > Hello, > > I've been using createDirectStream with Kafka and now need to switch to > the version of tha

createDirectStream with offsets

2016-05-06 Thread Eric Friedman
Hello, I've been using createDirectStream with Kafka and now need to switch to the version of that API that lets me supply offsets for my topics. I'm unable to get this to compile for some reason, even if I lift the very same usage from the Spark test suite. I'm calling it like this: val to

Hadoop Context

2016-04-28 Thread Eric Friedman
Hello, Where in the Spark APIs can I get access to the Hadoop Context instance? I am trying to implement the Spark equivalent of this public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { if (record == null) { throw

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Eric Friedman
not visible to the > Maven process. Or maybe you have JRE 7 installed but not JDK 7 and > it's somehow still finding the Java 6 javac. > > On Tue, Aug 25, 2015 at 3:45 AM, Eric Friedman > wrote: > > I'm trying to build Spark 1.4 with Java 7 and despite having tha

Re: build spark 1.4.1 with JDK 1.6

2015-08-24 Thread Eric Friedman
I'm trying to build Spark 1.4 with Java 7 and despite having that as my JAVA_HOME, I get [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-launcher_2.10 --- [INFO] Using zinc server for incremental compilation [info] Compiling 8 Java sources to /Users/eric/spark/spark/lau

projection optimization?

2015-07-28 Thread Eric Friedman
If I have a Hive table with six columns and create a DataFrame (Spark 1.4.1) using a sqlContext.sql("select * from ...") query, the resulting physical plan shown by explain reflects the goal of returning all six columns. If I then call select("one_column") on that first DataFrame, the resulting Da

KMeans questions

2015-07-01 Thread Eric Friedman
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train method, is there a cleaner way to create the Vectors than this? data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6))} Second, once I train the model and call predict on my

SPARK-8566

2015-06-23 Thread Eric Friedman
I logged this Jira this morning: https://issues.apache.org/jira/browse/SPARK-8566 I'm curious if any of the cognoscenti can advise as to a likely cause of the problem?

Re: join two DataFrames, same column name

2015-03-23 Thread Eric Friedman
ot;t2") > df1.join(df2, df1("column_id") === df2("column_id")).select("t1.column_id") > > Finally, there is SPARK-6380 > <https://issues.apache.org/jira/browse/SPARK-6380> that hopes to simplify > this particular case. > > Michael > >

Re: join two DataFrames, same column name

2015-03-23 Thread Eric Friedman
1") > var df2 = sqlContext.sql("select * from table2).as("t2") > df1.join(df2, df1("column_id") === df2("column_id")).select("t1.column_id") > > Finally, there is SPARK-6380 > <https://issues.apache.org/jira/browse/SPARK-6380> th

join two DataFrames, same column name

2015-03-21 Thread Eric Friedman
I have a couple of data frames that I pulled from SparkSQL and the primary key of one is a foreign key of the same name in the other. I'd rather not have to specify each column in the SELECT statement just so that I can rename this single column. When I try to join the data frames, I get an excep

ShuffleBlockFetcherIterator: Failed to get block(s)

2015-03-20 Thread Eric Friedman
My job crashes with a bunch of these messages in the YARN logs. What are the appropriate steps in troubleshooting? 15/03/19 23:29:45 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 10 outstanding blocks (after 3 retries) 15/03/19 23:29:45 ERROR storage.ShuffleBlockFetcherI

Re: 1.3 release

2015-03-18 Thread Eric Friedman
gt; > I don't recall seeing that particular error before. It indicates to me > that the SparkContext is null. Is this maybe a knock-on error from the > SparkContext not initializing? I can see it would then cause this to > fail to init. > > On Tue, Mar 17, 2015 at 7:16 PM, Eric

Re: 1.3 release

2015-03-17 Thread Eric Friedman
Owen wrote: > OK, did you build with YARN support (-Pyarn)? and the right > incantation of flags like "-Phadoop-2.4 > -Dhadoop.version=2.5.0-cdh5.3.2" or similar? > > On Tue, Mar 17, 2015 at 2:39 PM, Eric Friedman > wrote: > > I did not find that the generic bui

Re: 1.3 release

2015-03-17 Thread Eric Friedman
e stock Hadoop build doesn't work > on MapR? that would actually surprise me, but then, why are these two > builds distributed? > > > On Sun, Mar 15, 2015 at 6:22 AM, Eric Friedman > wrote: > > Is there a reason why the

1.3 release

2015-03-14 Thread Eric Friedman
Is there a reason why the prebuilt releases don't include current CDH distros and YARN support? Eric Friedman - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: u

Re: python: module pyspark.daemon not found

2014-12-30 Thread Eric Friedman
, 2014 at 1:27 PM, Eric Friedman > wrote: >> Was your spark assembly jarred with Java 7? There's a known issue with jar >> files made with that version. It prevents them from being used on >> PYTHONPATH. You can rejar with Java 6 for better results. >> >> ---

Re: SparkContext with error from PySpark

2014-12-30 Thread Eric Friedman
The Python installed in your cluster is 2.5. You need at least 2.6. Eric Friedman > On Dec 30, 2014, at 7:45 AM, Jaggu wrote: > > Hi Team, > > I was trying to execute a Pyspark code in cluster. It gives me the following > error. (Wne I run the same job in local it i

Re: python: module pyspark.daemon not found

2014-12-29 Thread Eric Friedman
Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It prevents them from being used on PYTHONPATH. You can rejar with Java 6 for better results. Eric Friedman > On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala > wrote:

Re: action progress in ipython notebook?

2014-12-29 Thread Eric Friedman
UI inside-out but not to anyone else. Thank you very much. Eric > > Matei > > >> The lower latency potential of Sparrow is also very intriguing. >> >> Getting GraphX for PySpark would be very welcome. >> >> It's easy to find fault, of co

Re: action progress in ipython notebook?

2014-12-28 Thread Eric Friedman
parrow is also very intriguing. Getting GraphX for PySpark would be very welcome. It's easy to find fault, of course. I do want to say again how grateful I am to have a usable release in 1.2 and look forward to 1.3 and beyond with real excitement. Eric Friedman > On Dec 28, 2014,

Re: action progress in ipython notebook?

2014-12-28 Thread Eric Friedman
n, as well as expanding the > types of information exposed through the stable status API interface. > > - Josh > >> On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman >> wrote: >> Spark 1.2.0 is SO much more usable than previous releases -- many thanks to >> t

action progress in ipython notebook?

2014-12-25 Thread Eric Friedman
Spark 1.2.0 is SO much more usable than previous releases -- many thanks to the team for this release. A question about progress of actions. I can see how things are progressing using the Spark UI. I can also see the nice ASCII art animation on the spark driver console. Has anyone come up with

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Eric Friedman
+1 Eric Friedman > On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung wrote: > > Are there a large number of non-deterministic lineage operators? > > This seems like a pretty big caveat, particularly for casual programmers who > expect consistent semantics between Spark a

Re: schema for schema

2014-09-18 Thread Eric Friedman
t; > > On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman < > eric.d.fried...@gmail.com> > > wrote: > >> > >> I have a SchemaRDD which I've gotten from a parquetFile. > >> > >> Did some transforms on it and now want to save it back out a

schema for schema

2014-09-18 Thread Eric Friedman
I have a SchemaRDD which I've gotten from a parquetFile. Did some transforms on it and now want to save it back out as parquet again. Getting a SchemaRDD proves challenging because some of my fields can be null/None and SQLContext.inferSchema abjects those. So, I decided to use the schema on the

Re: pyspark on yarn - lost executor

2014-09-17 Thread Eric Friedman
How many partitions do you have in your input rdd? Are you specifying numPartitions in subsequent calls to groupByKey/reduceByKey? > On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets wrote: > > Hi , > I am execution pyspark on yarn. > I have successfully executed initial dataset but now I growe

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
le files at a time. > > I'm also not sure if this is something to do with pyspark, since the > underlying Scala API takes a Configuration object rather than > dictionary. > > On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman > wrote: > > That would be awesome, but doesn

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
educe.input.fileinputformat.split.maxsize instead, which appears > to be the intended replacement in the new APIs. > > On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman > wrote: > > sc.textFile takes a minimum # of partitions to use. > > > > is there a way to get sc.newAP

minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
sc.textFile takes a minimum # of partitions to use. is there a way to get sc.newAPIHadoopFile to do the same? I know I can repartition() and get a shuffle. I'm wondering if there's a way to tell the underlying InputFormat (AvroParquet, in my case) how many partitions to use at the outset. What

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Eric Friedman
eate > RDD by > newAPIHadoopFile(), then union them together. > > On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman > wrote: > > I neglected to specify that I'm using pyspark. Doesn't look like these > APIs have been bridged. > > > > > > Eric Friedma

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Eric Friedman
I neglected to specify that I'm using pyspark. Doesn't look like these APIs have been bridged. Eric Friedman > On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan wrote: > > Hi Eric, > > Something along the lines of the following should work > > val fs =

PathFilter for newAPIHadoopFile?

2014-09-14 Thread Eric Friedman
Hi, I have a directory structure with parquet+avro data in it. There are a couple of administrative files (.foo and/or _foo) that I need to ignore when processing this data or Spark tries to read them as containing parquet content, which they do not. How can I set a PathFilter on the FileInputFor

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-23 Thread Eric Friedman
Yes. And point that variable at your virtual env python. Eric Friedman > On Aug 22, 2014, at 6:08 AM, Earthson wrote: > > Do I have to deploy Python to every machine to make "$PYSPARK_PYTHON" work > correctly? > > > > -- > View this message in cont

Re: Running Spark shell on YARN

2014-08-16 Thread Eric Friedman
+1 for such a document. Eric Friedman > On Aug 15, 2014, at 1:10 PM, Kevin Markey wrote: > > Sandy and others: > > Is there a single source of Yarn/Hadoop properties that should be set or > reset for running Spark on Yarn? > We've sort of stumbled through

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
On Sun, Aug 10, 2014 at 2:43 PM, Michael Armbrust wrote: > >> > if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in >> order to get DeprecatedParquetInputFormat, I find out that there is an >> incompatibility in the SerDeUtils class. Spark's Hive snapshot expects to >> find >

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
Thanks Michael, I can try that too. I know you guys aren't in sales/marketing (thank G-d), but given all the hoopla about the CDH<->DataBricks partnership, it'd be awesome if you guys were somewhat more aligned, by which I mean that the DataBricks releases on Apache that say "for CDH5" would a

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
d apparently about CDH but I think the > issue is actually more broadly applicable.) > > > On Sun, Aug 10, 2014 at 9:20 PM, Eric Friedman > wrote: >> Hi Sean, >> >> Thanks for the reply. I'm on CDH 5.0.3 and upgrading the whole cluster to >> 5.1.0 wi

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
e? > > Because that's what the Spark that was shipped with CDH would have > done. Are you replacing / not using that? > > > > > > On Sun, Aug 10, 2014 at 5:36 PM, Eric Friedman > wrote: > > I have a CDH5.0.3 cluster with Hive tables written in Parquet. > &g

CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
I have a CDH5.0.3 cluster with Hive tables written in Parquet. The tables have the "DeprecatedParquetInputFormat" on their metadata, and when I try to select from one using Spark SQL, it blows up with a stack trace like this: java.lang.RuntimeException: java.lang.ClassNotFoundException: parquet.h

Re: pyspark script fails on EMR with an ERROR in configuring object.

2014-08-03 Thread Eric Friedman
I am close to giving up on PySpark on YARN. It simply doesn't work for straightforward operations and it's quite difficult to understand why. I would love to be proven wrong, by the way. Eric Friedman > On Aug 3, 2014, at 7:03 AM, Rahul Bhojwani > wrote: > > Th

Re: rdd.saveAsTextFile blows up

2014-07-24 Thread Eric Friedman
d the FileSystem object in our > code. > > Thanks > Best Regards > > >> On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman >> wrote: >> I'm trying to run a simple pipeline using PySpark, version 1.0.1 >> >> I've created an RDD over

rdd.saveAsTextFile blows up

2014-07-24 Thread Eric Friedman
I'm trying to run a simple pipeline using PySpark, version 1.0.1 I've created an RDD over a parquetFile and am mapping the contents with a transformer function and now wish to write the data out to HDFS. All of the executors fail with the same stack trace (below) I do get a directory on HDFS, bu

GraphX for pyspark?

2014-07-24 Thread Eric Friedman
I understand that GraphX is not yet available for pyspark. I was wondering if the Spark team has set a target release and timeframe for doing that work? Thank you, Eric

Re: Lost executors

2014-07-23 Thread Eric Friedman
. On Wed, Jul 23, 2014 at 8:40 PM, Eric Friedman wrote: > hi Andrew, > > Thanks for your note. Yes, I see a stack trace now. It seems to be an > issue with python interpreting a function I wish to apply to an RDD. The > stack trace is below. The function is a simple factori

Re: Lost executors

2014-07-23 Thread Eric Friedman
> some exception, and the message you see is just a side effect. > > Andrew > > > 2014-07-23 8:27 GMT-07:00 Eric Friedman : > > I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. >> Cluster resources are available to me via Yarn and I

Lost executors

2014-07-23 Thread Eric Friedman
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. Cluster resources are available to me via Yarn and I am seeing these errors quite often. ERROR YarnClientClusterScheduler: Lost executor 63 on : remote Akka client disassociated This is in an interactive shell session. I

Re: SparkSQL operator priority

2014-07-19 Thread Eric Friedman
Can position be null? Looks like there may be constraints with predicate push down in that case. https://github.com/apache/spark/pull/511/ > On Jul 18, 2014, at 8:04 PM, Christos Kozanitis > wrote: > > Hello > > What is the order with which SparkSQL deserializes parquet fields? Is it > poss

replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Eric Friedman
I used to use SPARK_LIBRARY_PATH to specify the location of native libs for lzo compression when using spark 0.9.0. The references to that environment variable have disappeared from the docs for spark 1.0.1 and it's not clear how to specify the location for lzo. Any guidance?

Re: specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
tml > > -Sandy > > > On Mon, May 19, 2014 at 8:08 AM, Eric Friedman wrote: > Hi > > I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how > to get the spark repo to use more than 2 nodes in an interactive session. > > So, this works, but

specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
Hi I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the spark repo to use more than 2 nodes in an interactive session. So, this works, but is non-interactive (using yarn-client as MASTER) /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-cla