the types you're passing in don't
> match. For instance, you're passing in a message handler that returns
> a tuple, but the rdd return type you're specifying (the 5th type
> argument) is just String.
>
> On Fri, May 6, 2016 at 9:49 AM, Eric Friedman
>
.1'
compile 'org.apache.kafka:kafka_2.10:0.8.2.1'
compile 'com.yammer.metrics:metrics-core:2.2.0'
On Fri, May 6, 2016 at 7:47 AM, Eric Friedman
wrote:
> Hello,
>
> I've been using createDirectStream with Kafka and now need to switch to
> the version of tha
Hello,
I've been using createDirectStream with Kafka and now need to switch to the
version of that API that lets me supply offsets for my topics. I'm unable
to get this to compile for some reason, even if I lift the very same usage
from the Spark test suite.
I'm calling it like this:
val to
Hello,
Where in the Spark APIs can I get access to the Hadoop Context instance? I
am trying to implement the Spark equivalent of this
public void reduce(Text key, Iterable values, Context
context)
throws IOException, InterruptedException {
if (record == null) {
throw
not visible to the
> Maven process. Or maybe you have JRE 7 installed but not JDK 7 and
> it's somehow still finding the Java 6 javac.
>
> On Tue, Aug 25, 2015 at 3:45 AM, Eric Friedman
> wrote:
> > I'm trying to build Spark 1.4 with Java 7 and despite having tha
I'm trying to build Spark 1.4 with Java 7 and despite having that as my
JAVA_HOME, I get
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-launcher_2.10 ---
[INFO] Using zinc server for incremental compilation
[info] Compiling 8 Java sources to
/Users/eric/spark/spark/lau
If I have a Hive table with six columns and create a DataFrame (Spark
1.4.1) using a sqlContext.sql("select * from ...") query, the resulting
physical plan shown by explain reflects the goal of returning all six
columns.
If I then call select("one_column") on that first DataFrame, the resulting
Da
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train
method, is there a cleaner way to create the Vectors than this?
data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4),
r.getDouble(5), r.getDouble(6))}
Second, once I train the model and call predict on my
I logged this Jira this morning:
https://issues.apache.org/jira/browse/SPARK-8566
I'm curious if any of the cognoscenti can advise as to a likely cause of
the problem?
ot;t2")
> df1.join(df2, df1("column_id") === df2("column_id")).select("t1.column_id")
>
> Finally, there is SPARK-6380
> <https://issues.apache.org/jira/browse/SPARK-6380> that hopes to simplify
> this particular case.
>
> Michael
>
>
1")
> var df2 = sqlContext.sql("select * from table2).as("t2")
> df1.join(df2, df1("column_id") === df2("column_id")).select("t1.column_id")
>
> Finally, there is SPARK-6380
> <https://issues.apache.org/jira/browse/SPARK-6380> th
I have a couple of data frames that I pulled from SparkSQL and the primary
key of one is a foreign key of the same name in the other. I'd rather not
have to specify each column in the SELECT statement just so that I can
rename this single column.
When I try to join the data frames, I get an excep
My job crashes with a bunch of these messages in the YARN logs.
What are the appropriate steps in troubleshooting?
15/03/19 23:29:45 ERROR shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 10 outstanding blocks (after 3 retries)
15/03/19 23:29:45 ERROR storage.ShuffleBlockFetcherI
gt;
> I don't recall seeing that particular error before. It indicates to me
> that the SparkContext is null. Is this maybe a knock-on error from the
> SparkContext not initializing? I can see it would then cause this to
> fail to init.
>
> On Tue, Mar 17, 2015 at 7:16 PM, Eric
Owen wrote:
> OK, did you build with YARN support (-Pyarn)? and the right
> incantation of flags like "-Phadoop-2.4
> -Dhadoop.version=2.5.0-cdh5.3.2" or similar?
>
> On Tue, Mar 17, 2015 at 2:39 PM, Eric Friedman
> wrote:
> > I did not find that the generic bui
e stock Hadoop build doesn't work
> on MapR? that would actually surprise me, but then, why are these two
> builds distributed?
>
>
> On Sun, Mar 15, 2015 at 6:22 AM, Eric Friedman
> wrote:
> > Is there a reason why the
Is there a reason why the prebuilt releases don't include current CDH distros
and YARN support?
Eric Friedman
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: u
, 2014 at 1:27 PM, Eric Friedman
> wrote:
>> Was your spark assembly jarred with Java 7? There's a known issue with jar
>> files made with that version. It prevents them from being used on
>> PYTHONPATH. You can rejar with Java 6 for better results.
>>
>> ---
The Python installed in your cluster is 2.5. You need at least 2.6.
Eric Friedman
> On Dec 30, 2014, at 7:45 AM, Jaggu wrote:
>
> Hi Team,
>
> I was trying to execute a Pyspark code in cluster. It gives me the following
> error. (Wne I run the same job in local it i
Was your spark assembly jarred with Java 7? There's a known issue with jar
files made with that version. It prevents them from being used on PYTHONPATH.
You can rejar with Java 6 for better results.
Eric Friedman
> On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala
> wrote:
UI inside-out but not to anyone else.
Thank you very much.
Eric
>
> Matei
>
>
>> The lower latency potential of Sparrow is also very intriguing.
>>
>> Getting GraphX for PySpark would be very welcome.
>>
>> It's easy to find fault, of co
parrow is
also very intriguing.
Getting GraphX for PySpark would be very welcome.
It's easy to find fault, of course. I do want to say again how grateful I am to
have a usable release in 1.2 and look forward to 1.3 and beyond with real
excitement.
Eric Friedman
> On Dec 28, 2014,
n, as well as expanding the
> types of information exposed through the stable status API interface.
>
> - Josh
>
>> On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman
>> wrote:
>> Spark 1.2.0 is SO much more usable than previous releases -- many thanks to
>> t
Spark 1.2.0 is SO much more usable than previous releases -- many thanks to
the team for this release.
A question about progress of actions. I can see how things are progressing
using the Spark UI. I can also see the nice ASCII art animation on the
spark driver console.
Has anyone come up with
+1
Eric Friedman
> On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung wrote:
>
> Are there a large number of non-deterministic lineage operators?
>
> This seems like a pretty big caveat, particularly for casual programmers who
> expect consistent semantics between Spark a
t;
> > On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman <
> eric.d.fried...@gmail.com>
> > wrote:
> >>
> >> I have a SchemaRDD which I've gotten from a parquetFile.
> >>
> >> Did some transforms on it and now want to save it back out a
I have a SchemaRDD which I've gotten from a parquetFile.
Did some transforms on it and now want to save it back out as parquet again.
Getting a SchemaRDD proves challenging because some of my fields can be
null/None and SQLContext.inferSchema abjects those.
So, I decided to use the schema on the
How many partitions do you have in your input rdd? Are you specifying
numPartitions in subsequent calls to groupByKey/reduceByKey?
> On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets wrote:
>
> Hi ,
> I am execution pyspark on yarn.
> I have successfully executed initial dataset but now I growe
le files at a time.
>
> I'm also not sure if this is something to do with pyspark, since the
> underlying Scala API takes a Configuration object rather than
> dictionary.
>
> On Mon, Sep 15, 2014 at 11:23 PM, Eric Friedman
> wrote:
> > That would be awesome, but doesn
educe.input.fileinputformat.split.maxsize instead, which appears
> to be the intended replacement in the new APIs.
>
> On Mon, Sep 15, 2014 at 9:35 PM, Eric Friedman
> wrote:
> > sc.textFile takes a minimum # of partitions to use.
> >
> > is there a way to get sc.newAP
sc.textFile takes a minimum # of partitions to use.
is there a way to get sc.newAPIHadoopFile to do the same?
I know I can repartition() and get a shuffle. I'm wondering if there's a
way to tell the underlying InputFormat (AvroParquet, in my case) how many
partitions to use at the outset.
What
eate
> RDD by
> newAPIHadoopFile(), then union them together.
>
> On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
> wrote:
> > I neglected to specify that I'm using pyspark. Doesn't look like these
> APIs have been bridged.
> >
> >
> > Eric Friedma
I neglected to specify that I'm using pyspark. Doesn't look like these APIs
have been bridged.
Eric Friedman
> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan wrote:
>
> Hi Eric,
>
> Something along the lines of the following should work
>
> val fs =
Hi,
I have a directory structure with parquet+avro data in it. There are a
couple of administrative files (.foo and/or _foo) that I need to ignore
when processing this data or Spark tries to read them as containing parquet
content, which they do not.
How can I set a PathFilter on the FileInputFor
Yes. And point that variable at your virtual env python.
Eric Friedman
> On Aug 22, 2014, at 6:08 AM, Earthson wrote:
>
> Do I have to deploy Python to every machine to make "$PYSPARK_PYTHON" work
> correctly?
>
>
>
> --
> View this message in cont
+1 for such a document.
Eric Friedman
> On Aug 15, 2014, at 1:10 PM, Kevin Markey wrote:
>
> Sandy and others:
>
> Is there a single source of Yarn/Hadoop properties that should be set or
> reset for running Spark on Yarn?
> We've sort of stumbled through
On Sun, Aug 10, 2014 at 2:43 PM, Michael Armbrust
wrote:
>
>>
> if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in
>> order to get DeprecatedParquetInputFormat, I find out that there is an
>> incompatibility in the SerDeUtils class. Spark's Hive snapshot expects to
>> find
>
Thanks Michael, I can try that too.
I know you guys aren't in sales/marketing (thank G-d), but given all the hoopla
about the CDH<->DataBricks partnership, it'd be awesome if you guys were
somewhat more aligned, by which I mean that the DataBricks releases on Apache
that say "for CDH5" would a
d apparently about CDH but I think the
> issue is actually more broadly applicable.)
>
>
> On Sun, Aug 10, 2014 at 9:20 PM, Eric Friedman
> wrote:
>> Hi Sean,
>>
>> Thanks for the reply. I'm on CDH 5.0.3 and upgrading the whole cluster to
>> 5.1.0 wi
e?
>
> Because that's what the Spark that was shipped with CDH would have
> done. Are you replacing / not using that?
>
>
>
>
>
> On Sun, Aug 10, 2014 at 5:36 PM, Eric Friedman
> wrote:
> > I have a CDH5.0.3 cluster with Hive tables written in Parquet.
> &g
I have a CDH5.0.3 cluster with Hive tables written in Parquet.
The tables have the "DeprecatedParquetInputFormat" on their metadata, and
when I try to select from one using Spark SQL, it blows up with a stack
trace like this:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
parquet.h
I am close to giving up on PySpark on YARN. It simply doesn't work for
straightforward operations and it's quite difficult to understand why.
I would love to be proven wrong, by the way.
Eric Friedman
> On Aug 3, 2014, at 7:03 AM, Rahul Bhojwani
> wrote:
>
> Th
d the FileSystem object in our
> code.
>
> Thanks
> Best Regards
>
>
>> On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman
>> wrote:
>> I'm trying to run a simple pipeline using PySpark, version 1.0.1
>>
>> I've created an RDD over
I'm trying to run a simple pipeline using PySpark, version 1.0.1
I've created an RDD over a parquetFile and am mapping the contents with a
transformer function and now wish to write the data out to HDFS.
All of the executors fail with the same stack trace (below)
I do get a directory on HDFS, bu
I understand that GraphX is not yet available for pyspark. I was wondering
if the Spark team has set a target release and timeframe for doing that
work?
Thank you,
Eric
.
On Wed, Jul 23, 2014 at 8:40 PM, Eric Friedman
wrote:
> hi Andrew,
>
> Thanks for your note. Yes, I see a stack trace now. It seems to be an
> issue with python interpreting a function I wish to apply to an RDD. The
> stack trace is below. The function is a simple factori
> some exception, and the message you see is just a side effect.
>
> Andrew
>
>
> 2014-07-23 8:27 GMT-07:00 Eric Friedman :
>
> I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
>> Cluster resources are available to me via Yarn and I
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
Cluster resources are available to me via Yarn and I am seeing these
errors quite often.
ERROR YarnClientClusterScheduler: Lost executor 63 on : remote Akka
client disassociated
This is in an interactive shell session. I
Can position be null? Looks like there may be constraints with predicate push
down in that case. https://github.com/apache/spark/pull/511/
> On Jul 18, 2014, at 8:04 PM, Christos Kozanitis
> wrote:
>
> Hello
>
> What is the order with which SparkSQL deserializes parquet fields? Is it
> poss
I used to use SPARK_LIBRARY_PATH to specify the location of native libs
for lzo compression when using spark 0.9.0.
The references to that environment variable have disappeared from the docs
for
spark 1.0.1 and it's not clear how to specify the location for lzo.
Any guidance?
tml
>
> -Sandy
>
>
> On Mon, May 19, 2014 at 8:08 AM, Eric Friedman wrote:
> Hi
>
> I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how
> to get the spark repo to use more than 2 nodes in an interactive session.
>
> So, this works, but
Hi
I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to
get the spark repo to use more than 2 nodes in an interactive session.
So, this works, but is non-interactive (using yarn-client as MASTER)
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/bin/spark-cla
52 matches
Mail list logo