Re: createDirectStream with offsets

2016-05-07 Thread Eric Friedman
carefully at the error message, the types you're passing in don't > match. For instance, you're passing in a message handler that returns > a tuple, but the rdd return type you're specifying (the 5th type > argument) is just String. > > On Fri, May 6, 2016 at 9:49 AM, Eric Friedman &l

Re: createDirectStream with offsets

2016-05-06 Thread Eric Friedman
'com.yammer.metrics:metrics-core:2.2.0' On Fri, May 6, 2016 at 7:47 AM, Eric Friedman <eric.d.fried...@gmail.com> wrote: > Hello, > > I've been using createDirectStream with Kafka and now need to switch to > the version of that API that lets me supply offsets for my topics. I'm > unable t

createDirectStream with offsets

2016-05-06 Thread Eric Friedman
Hello, I've been using createDirectStream with Kafka and now need to switch to the version of that API that lets me supply offsets for my topics. I'm unable to get this to compile for some reason, even if I lift the very same usage from the Spark test suite. I'm calling it like this: val

Hadoop Context

2016-04-28 Thread Eric Friedman
Hello, Where in the Spark APIs can I get access to the Hadoop Context instance? I am trying to implement the Spark equivalent of this public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { if (record == null) {

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Eric Friedman
installed but not JDK 7 and it's somehow still finding the Java 6 javac. On Tue, Aug 25, 2015 at 3:45 AM, Eric Friedman eric.d.fried...@gmail.com wrote: I'm trying to build Spark 1.4 with Java 7 and despite having that as my JAVA_HOME, I get [INFO] --- scala-maven-plugin:3.2.2:compile (scala

Re: build spark 1.4.1 with JDK 1.6

2015-08-24 Thread Eric Friedman
I'm trying to build Spark 1.4 with Java 7 and despite having that as my JAVA_HOME, I get [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-launcher_2.10 --- [INFO] Using zinc server for incremental compilation [info] Compiling 8 Java sources to

projection optimization?

2015-07-28 Thread Eric Friedman
If I have a Hive table with six columns and create a DataFrame (Spark 1.4.1) using a sqlContext.sql(select * from ...) query, the resulting physical plan shown by explain reflects the goal of returning all six columns. If I then call select(one_column) on that first DataFrame, the resulting

KMeans questions

2015-07-01 Thread Eric Friedman
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train method, is there a cleaner way to create the Vectors than this? data.map{r = Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6))} Second, once I train the model and call predict on my

SPARK-8566

2015-06-23 Thread Eric Friedman
I logged this Jira this morning: https://issues.apache.org/jira/browse/SPARK-8566 I'm curious if any of the cognoscenti can advise as to a likely cause of the problem?

Re: join two DataFrames, same column name

2015-03-23 Thread Eric Friedman
, there is SPARK-6380 https://issues.apache.org/jira/browse/SPARK-6380 that hopes to simplify this particular case. Michael On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I have a couple of data frames that I pulled from SparkSQL and the primary key of one is a foreign

Re: join two DataFrames, same column name

2015-03-23 Thread Eric Friedman
) df1.join(df2, df1(column_id) === df2(column_id)).select(t1.column_id) Finally, there is SPARK-6380 https://issues.apache.org/jira/browse/SPARK-6380 that hopes to simplify this particular case. Michael On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I

join two DataFrames, same column name

2015-03-21 Thread Eric Friedman
I have a couple of data frames that I pulled from SparkSQL and the primary key of one is a foreign key of the same name in the other. I'd rather not have to specify each column in the SELECT statement just so that I can rename this single column. When I try to join the data frames, I get an

ShuffleBlockFetcherIterator: Failed to get block(s)

2015-03-20 Thread Eric Friedman
My job crashes with a bunch of these messages in the YARN logs. What are the appropriate steps in troubleshooting? 15/03/19 23:29:45 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 10 outstanding blocks (after 3 retries) 15/03/19 23:29:45 ERROR

Re: 1.3 release

2015-03-18 Thread Eric Friedman
seeing that particular error before. It indicates to me that the SparkContext is null. Is this maybe a knock-on error from the SparkContext not initializing? I can see it would then cause this to fail to init. On Tue, Mar 17, 2015 at 7:16 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Yes

Re: 1.3 release

2015-03-17 Thread Eric Friedman
, 2015 at 7:43 AM, Sean Owen so...@cloudera.com wrote: OK, did you build with YARN support (-Pyarn)? and the right incantation of flags like -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.2 or similar? On Tue, Mar 17, 2015 at 2:39 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I did not find

Re: 1.3 release

2015-03-17 Thread Eric Friedman
surprise me, but then, why are these two builds distributed? On Sun, Mar 15, 2015 at 6:22 AM, Eric Friedman eric.d.fried...@gmail.com wrote: Is there a reason why the prebuilt releases don't include current CDH distros and YARN support? Eric Friedman

1.3 release

2015-03-15 Thread Eric Friedman
Is there a reason why the prebuilt releases don't include current CDH distros and YARN support? Eric Friedman - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: SparkContext with error from PySpark

2014-12-30 Thread Eric Friedman
The Python installed in your cluster is 2.5. You need at least 2.6. Eric Friedman On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote: Hi Team, I was trying to execute a Pyspark code in cluster. It gives me the following error. (Wne I run the same job in local

Re: python: module pyspark.daemon not found

2014-12-30 Thread Eric Friedman
:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It prevents them from being used on PYTHONPATH. You can rejar with Java 6 for better results. Eric Friedman On Dec 29, 2014

Re: action progress in ipython notebook?

2014-12-29 Thread Eric Friedman
how grateful I am to have a usable release in 1.2 and look forward to 1.3 and beyond with real excitement. Eric Friedman On Dec 28, 2014, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Eric, I'm just curious - which specific features in 1.2 do you find most help

Re: python: module pyspark.daemon not found

2014-12-29 Thread Eric Friedman
Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It prevents them from being used on PYTHONPATH. You can rejar with Java 6 for better results. Eric Friedman On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala npok...@spcapitaliq.com

Re: action progress in ipython notebook?

2014-12-28 Thread Eric Friedman
intriguing. Getting GraphX for PySpark would be very welcome. It's easy to find fault, of course. I do want to say again how grateful I am to have a usable release in 1.2 and look forward to 1.3 and beyond with real excitement. Eric Friedman On Dec 28, 2014, at 5:40 PM, Patrick Wendell

action progress in ipython notebook?

2014-12-25 Thread Eric Friedman
Spark 1.2.0 is SO much more usable than previous releases -- many thanks to the team for this release. A question about progress of actions. I can see how things are progressing using the Spark UI. I can also see the nice ASCII art animation on the spark driver console. Has anyone come up with

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Eric Friedman
+1 Eric Friedman On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Are there a large number of non-deterministic lineage operators? This seems like a pretty big caveat, particularly for casual programmers who expect consistent semantics between Spark

schema for schema

2014-09-18 Thread Eric Friedman
I have a SchemaRDD which I've gotten from a parquetFile. Did some transforms on it and now want to save it back out as parquet again. Getting a SchemaRDD proves challenging because some of my fields can be null/None and SQLContext.inferSchema abjects those. So, I decided to use the schema on

Re: schema for schema

2014-09-18 Thread Eric Friedman
are investigating. On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman eric.d.fried...@gmail.com wrote: I have a SchemaRDD which I've gotten from a parquetFile. Did some transforms on it and now want to save it back out as parquet again. Getting a SchemaRDD proves challenging because some of my

Re: pyspark on yarn - lost executor

2014-09-17 Thread Eric Friedman
How many partitions do you have in your input rdd? Are you specifying numPartitions in subsequent calls to groupByKey/reduceByKey? On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am execution pyspark on yarn. I have successfully executed initial dataset

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Eric Friedman
, and create RDD by newAPIHadoopFile(), then union them together. On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman eric.d.fried...@gmail.com wrote: I neglected to specify that I'm using pyspark. Doesn't look like these APIs have been bridged. Eric Friedman On Sep 14, 2014, at 11:02 PM

minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
sc.textFile takes a minimum # of partitions to use. is there a way to get sc.newAPIHadoopFile to do the same? I know I can repartition() and get a shuffle. I'm wondering if there's a way to tell the underlying InputFormat (AvroParquet, in my case) how many partitions to use at the outset. What

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
, which appears to be the intended replacement in the new APIs. On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman eric.d.fried...@gmail.com wrote: sc.textFile takes a minimum # of partitions to use. is there a way to get sc.newAPIHadoopFile to do the same? I know I can repartition() and get

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
. I'm also not sure if this is something to do with pyspark, since the underlying Scala API takes a Configuration object rather than dictionary. On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman eric.d.fried...@gmail.com wrote: That would be awesome, but doesn't seem to have any effect

PathFilter for newAPIHadoopFile?

2014-09-14 Thread Eric Friedman
Hi, I have a directory structure with parquet+avro data in it. There are a couple of administrative files (.foo and/or _foo) that I need to ignore when processing this data or Spark tries to read them as containing parquet content, which they do not. How can I set a PathFilter on the

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-23 Thread Eric Friedman
Yes. And point that variable at your virtual env python. Eric Friedman On Aug 22, 2014, at 6:08 AM, Earthson earthson...@gmail.com wrote: Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work correctly? -- View this message in context: http://apache-spark

Re: Running Spark shell on YARN

2014-08-16 Thread Eric Friedman
+1 for such a document. Eric Friedman On Aug 15, 2014, at 1:10 PM, Kevin Markey kevin.mar...@oracle.com wrote: Sandy and others: Is there a single source of Yarn/Hadoop properties that should be set or reset for running Spark on Yarn? We've sort of stumbled through one property

CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
I have a CDH5.0.3 cluster with Hive tables written in Parquet. The tables have the DeprecatedParquetInputFormat on their metadata, and when I try to select from one using Spark SQL, it blows up with a stack trace like this: java.lang.RuntimeException: java.lang.ClassNotFoundException:

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
done. Are you replacing / not using that? On Sun, Aug 10, 2014 at 5:36 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I have a CDH5.0.3 cluster with Hive tables written in Parquet. The tables have the DeprecatedParquetInputFormat on their metadata, and when I try to select from one

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
:20 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Hi Sean, Thanks for the reply. I'm on CDH 5.0.3 and upgrading the whole cluster to 5.1.0 will eventually happen but not immediately. I've tried running the CDH spark-1.0 release and also building it from source. This, unfortunately

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
Thanks Michael, I can try that too. I know you guys aren't in sales/marketing (thank G-d), but given all the hoopla about the CDH-DataBricks partnership, it'd be awesome if you guys were somewhat more aligned, by which I mean that the DataBricks releases on Apache that say for CDH5 would

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
On Sun, Aug 10, 2014 at 2:43 PM, Michael Armbrust mich...@databricks.com wrote: if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in order to get DeprecatedParquetInputFormat, I find out that there is an incompatibility in the SerDeUtils class. Spark's Hive snapshot

Re: rdd.saveAsTextFile blows up

2014-07-25 Thread Eric Friedman
Best Regards On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I'm trying to run a simple pipeline using PySpark, version 1.0.1 I've created an RDD over a parquetFile and am mapping the contents with a transformer function and now wish to write the data

GraphX for pyspark?

2014-07-24 Thread Eric Friedman
I understand that GraphX is not yet available for pyspark. I was wondering if the Spark team has set a target release and timeframe for doing that work? Thank you, Eric

rdd.saveAsTextFile blows up

2014-07-24 Thread Eric Friedman
I'm trying to run a simple pipeline using PySpark, version 1.0.1 I've created an RDD over a parquetFile and am mapping the contents with a transformer function and now wish to write the data out to HDFS. All of the executors fail with the same stack trace (below) I do get a directory on HDFS,

Lost executors

2014-07-23 Thread Eric Friedman
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. Cluster resources are available to me via Yarn and I am seeing these errors quite often. ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka client disassociated This is in an interactive shell

Re: Lost executors

2014-07-23 Thread Eric Friedman
, and the message you see is just a side effect. Andrew 2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com: I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. Cluster resources are available to me via Yarn and I am seeing these errors quite often. ERROR

Re: Lost executors

2014-07-23 Thread Eric Friedman
. On Wed, Jul 23, 2014 at 8:40 PM, Eric Friedman eric.d.fried...@gmail.com wrote: hi Andrew, Thanks for your note. Yes, I see a stack trace now. It seems to be an issue with python interpreting a function I wish to apply to an RDD. The stack trace is below. The function is a simple

Re: SparkSQL operator priority

2014-07-19 Thread Eric Friedman
Can position be null? Looks like there may be constraints with predicate push down in that case. https://github.com/apache/spark/pull/511/ On Jul 18, 2014, at 8:04 PM, Christos Kozanitis kozani...@berkeley.edu wrote: Hello What is the order with which SparkSQL deserializes parquet

replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Eric Friedman
I used to use SPARK_LIBRARY_PATH to specify the location of native libs for lzo compression when using spark 0.9.0. The references to that environment variable have disappeared from the docs for spark 1.0.1 and it's not clear how to specify the location for lzo. Any guidance?

specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
Hi I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the spark repo to use more than 2 nodes in an interactive session. So, this works, but is non-interactive (using yarn-client as MASTER)

Re: specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
-Sandy On Mon, May 19, 2014 at 8:08 AM, Eric Friedman e...@spottedsnake.net wrote: Hi I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the spark repo to use more than 2 nodes in an interactive session. So, this works, but is non-interactive (using