carefully at the error message, the types you're passing in don't
> match. For instance, you're passing in a message handler that returns
> a tuple, but the rdd return type you're specifying (the 5th type
> argument) is just String.
>
> On Fri, May 6, 2016 at 9:49 AM, Eric Friedman &l
'com.yammer.metrics:metrics-core:2.2.0'
On Fri, May 6, 2016 at 7:47 AM, Eric Friedman <eric.d.fried...@gmail.com>
wrote:
> Hello,
>
> I've been using createDirectStream with Kafka and now need to switch to
> the version of that API that lets me supply offsets for my topics. I'm
> unable t
Hello,
I've been using createDirectStream with Kafka and now need to switch to the
version of that API that lets me supply offsets for my topics. I'm unable
to get this to compile for some reason, even if I lift the very same usage
from the Spark test suite.
I'm calling it like this:
val
Hello,
Where in the Spark APIs can I get access to the Hadoop Context instance? I
am trying to implement the Spark equivalent of this
public void reduce(Text key, Iterable values, Context
context)
throws IOException, InterruptedException {
if (record == null) {
installed but not JDK 7 and
it's somehow still finding the Java 6 javac.
On Tue, Aug 25, 2015 at 3:45 AM, Eric Friedman
eric.d.fried...@gmail.com wrote:
I'm trying to build Spark 1.4 with Java 7 and despite having that as my
JAVA_HOME, I get
[INFO] --- scala-maven-plugin:3.2.2:compile (scala
I'm trying to build Spark 1.4 with Java 7 and despite having that as my
JAVA_HOME, I get
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-launcher_2.10 ---
[INFO] Using zinc server for incremental compilation
[info] Compiling 8 Java sources to
If I have a Hive table with six columns and create a DataFrame (Spark
1.4.1) using a sqlContext.sql(select * from ...) query, the resulting
physical plan shown by explain reflects the goal of returning all six
columns.
If I then call select(one_column) on that first DataFrame, the resulting
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train
method, is there a cleaner way to create the Vectors than this?
data.map{r = Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4),
r.getDouble(5), r.getDouble(6))}
Second, once I train the model and call predict on my
I logged this Jira this morning:
https://issues.apache.org/jira/browse/SPARK-8566
I'm curious if any of the cognoscenti can advise as to a likely cause of
the problem?
, there is SPARK-6380
https://issues.apache.org/jira/browse/SPARK-6380 that hopes to simplify
this particular case.
Michael
On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:
I have a couple of data frames that I pulled from SparkSQL and the
primary key of one is a foreign
)
df1.join(df2, df1(column_id) === df2(column_id)).select(t1.column_id)
Finally, there is SPARK-6380
https://issues.apache.org/jira/browse/SPARK-6380 that hopes to simplify
this particular case.
Michael
On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:
I
I have a couple of data frames that I pulled from SparkSQL and the primary
key of one is a foreign key of the same name in the other. I'd rather not
have to specify each column in the SELECT statement just so that I can
rename this single column.
When I try to join the data frames, I get an
My job crashes with a bunch of these messages in the YARN logs.
What are the appropriate steps in troubleshooting?
15/03/19 23:29:45 ERROR shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 10 outstanding blocks (after 3 retries)
15/03/19 23:29:45 ERROR
seeing that particular error before. It indicates to me
that the SparkContext is null. Is this maybe a knock-on error from the
SparkContext not initializing? I can see it would then cause this to
fail to init.
On Tue, Mar 17, 2015 at 7:16 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
Yes
, 2015 at 7:43 AM, Sean Owen so...@cloudera.com wrote:
OK, did you build with YARN support (-Pyarn)? and the right
incantation of flags like -Phadoop-2.4
-Dhadoop.version=2.5.0-cdh5.3.2 or similar?
On Tue, Mar 17, 2015 at 2:39 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
I did not find
surprise me, but then, why are these two
builds distributed?
On Sun, Mar 15, 2015 at 6:22 AM, Eric Friedman
eric.d.fried...@gmail.com wrote:
Is there a reason why the prebuilt releases don't include current CDH
distros and YARN support?
Eric Friedman
Is there a reason why the prebuilt releases don't include current CDH distros
and YARN support?
Eric Friedman
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
The Python installed in your cluster is 2.5. You need at least 2.6.
Eric Friedman
On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote:
Hi Team,
I was trying to execute a Pyspark code in cluster. It gives me the following
error. (Wne I run the same job in local
:27 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
Was your spark assembly jarred with Java 7? There's a known issue with jar
files made with that version. It prevents them from being used on
PYTHONPATH. You can rejar with Java 6 for better results.
Eric Friedman
On Dec 29, 2014
how grateful I am
to have a usable release in 1.2 and look forward to 1.3 and beyond with real
excitement.
Eric Friedman
On Dec 28, 2014, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey Eric,
I'm just curious - which specific features in 1.2 do you find most
help
Was your spark assembly jarred with Java 7? There's a known issue with jar
files made with that version. It prevents them from being used on PYTHONPATH.
You can rejar with Java 6 for better results.
Eric Friedman
On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala npok...@spcapitaliq.com
intriguing.
Getting GraphX for PySpark would be very welcome.
It's easy to find fault, of course. I do want to say again how grateful I am to
have a usable release in 1.2 and look forward to 1.3 and beyond with real
excitement.
Eric Friedman
On Dec 28, 2014, at 5:40 PM, Patrick Wendell
Spark 1.2.0 is SO much more usable than previous releases -- many thanks to
the team for this release.
A question about progress of actions. I can see how things are progressing
using the Spark UI. I can also see the nice ASCII art animation on the
spark driver console.
Has anyone come up with
+1
Eric Friedman
On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote:
Are there a large number of non-deterministic lineage operators?
This seems like a pretty big caveat, particularly for casual programmers who
expect consistent semantics between Spark
I have a SchemaRDD which I've gotten from a parquetFile.
Did some transforms on it and now want to save it back out as parquet again.
Getting a SchemaRDD proves challenging because some of my fields can be
null/None and SQLContext.inferSchema abjects those.
So, I decided to use the schema on
are investigating.
On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman
eric.d.fried...@gmail.com
wrote:
I have a SchemaRDD which I've gotten from a parquetFile.
Did some transforms on it and now want to save it back out as parquet
again.
Getting a SchemaRDD proves challenging because some of my
How many partitions do you have in your input rdd? Are you specifying
numPartitions in subsequent calls to groupByKey/reduceByKey?
On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I am execution pyspark on yarn.
I have successfully executed initial dataset
, and create
RDD by
newAPIHadoopFile(), then union them together.
On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
eric.d.fried...@gmail.com wrote:
I neglected to specify that I'm using pyspark. Doesn't look like these
APIs have been bridged.
Eric Friedman
On Sep 14, 2014, at 11:02 PM
sc.textFile takes a minimum # of partitions to use.
is there a way to get sc.newAPIHadoopFile to do the same?
I know I can repartition() and get a shuffle. I'm wondering if there's a
way to tell the underlying InputFormat (AvroParquet, in my case) how many
partitions to use at the outset.
What
, which appears
to be the intended replacement in the new APIs.
On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
sc.textFile takes a minimum # of partitions to use.
is there a way to get sc.newAPIHadoopFile to do the same?
I know I can repartition() and get
.
I'm also not sure if this is something to do with pyspark, since the
underlying Scala API takes a Configuration object rather than
dictionary.
On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
That would be awesome, but doesn't seem to have any effect
Hi,
I have a directory structure with parquet+avro data in it. There are a
couple of administrative files (.foo and/or _foo) that I need to ignore
when processing this data or Spark tries to read them as containing parquet
content, which they do not.
How can I set a PathFilter on the
Yes. And point that variable at your virtual env python.
Eric Friedman
On Aug 22, 2014, at 6:08 AM, Earthson earthson...@gmail.com wrote:
Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work
correctly?
--
View this message in context:
http://apache-spark
+1 for such a document.
Eric Friedman
On Aug 15, 2014, at 1:10 PM, Kevin Markey kevin.mar...@oracle.com wrote:
Sandy and others:
Is there a single source of Yarn/Hadoop properties that should be set or
reset for running Spark on Yarn?
We've sort of stumbled through one property
I have a CDH5.0.3 cluster with Hive tables written in Parquet.
The tables have the DeprecatedParquetInputFormat on their metadata, and
when I try to select from one using Spark SQL, it blows up with a stack
trace like this:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
done. Are you replacing / not using that?
On Sun, Aug 10, 2014 at 5:36 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
I have a CDH5.0.3 cluster with Hive tables written in Parquet.
The tables have the DeprecatedParquetInputFormat on their metadata, and
when I try to select from one
:20 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
Hi Sean,
Thanks for the reply. I'm on CDH 5.0.3 and upgrading the whole cluster to
5.1.0 will eventually happen but not immediately.
I've tried running the CDH spark-1.0 release and also building it from
source. This, unfortunately
Thanks Michael, I can try that too.
I know you guys aren't in sales/marketing (thank G-d), but given all the hoopla
about the CDH-DataBricks partnership, it'd be awesome if you guys were
somewhat more aligned, by which I mean that the DataBricks releases on Apache
that say for CDH5 would
On Sun, Aug 10, 2014 at 2:43 PM, Michael Armbrust mich...@databricks.com
wrote:
if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in
order to get DeprecatedParquetInputFormat, I find out that there is an
incompatibility in the SerDeUtils class. Spark's Hive snapshot
Best Regards
On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:
I'm trying to run a simple pipeline using PySpark, version 1.0.1
I've created an RDD over a parquetFile and am mapping the contents with a
transformer function and now wish to write the data
I understand that GraphX is not yet available for pyspark. I was wondering
if the Spark team has set a target release and timeframe for doing that
work?
Thank you,
Eric
I'm trying to run a simple pipeline using PySpark, version 1.0.1
I've created an RDD over a parquetFile and am mapping the contents with a
transformer function and now wish to write the data out to HDFS.
All of the executors fail with the same stack trace (below)
I do get a directory on HDFS,
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
Cluster resources are available to me via Yarn and I am seeing these
errors quite often.
ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka
client disassociated
This is in an interactive shell
, and the message you see is just a side effect.
Andrew
2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com:
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
Cluster resources are available to me via Yarn and I am seeing these
errors quite often.
ERROR
.
On Wed, Jul 23, 2014 at 8:40 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:
hi Andrew,
Thanks for your note. Yes, I see a stack trace now. It seems to be an
issue with python interpreting a function I wish to apply to an RDD. The
stack trace is below. The function is a simple
Can position be null? Looks like there may be constraints with predicate push
down in that case. https://github.com/apache/spark/pull/511/
On Jul 18, 2014, at 8:04 PM, Christos Kozanitis kozani...@berkeley.edu
wrote:
Hello
What is the order with which SparkSQL deserializes parquet
I used to use SPARK_LIBRARY_PATH to specify the location of native libs
for lzo compression when using spark 0.9.0.
The references to that environment variable have disappeared from the docs
for
spark 1.0.1 and it's not clear how to specify the location for lzo.
Any guidance?
Hi
I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to
get the spark repo to use more than 2 nodes in an interactive session.
So, this works, but is non-interactive (using yarn-client as MASTER)
-Sandy
On Mon, May 19, 2014 at 8:08 AM, Eric Friedman e...@spottedsnake.net wrote:
Hi
I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how
to get the spark repo to use more than 2 nodes in an interactive session.
So, this works, but is non-interactive (using
49 matches
Mail list logo