Cool! Thanks for your inputs Jacek and Mark!
From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: 13 January 2017 12:59
To: Phadnis, Varun
Cc: user@spark.apache.org
Subject: Re: Spark and Kafka integration
See "API compatibility" in
a better way to answer my question: use GenericRow instead of Row
val rows: RDD[Row] = spark.sparkContext.textFile("/sourcedata/test/test1").map {
line =>
{
val attributes: Array[String] = line.split(",")
val ab = ArrayBuffer[Any]()
for (i
I faced a similar issue and had to do two things;
1. Submit Kryo jar with the spark-submit
2. Set spark.executor.userClassPathFirst true in Spark conf
On Fri, Nov 18, 2016 at 7:39 PM, chrism
wrote:
> Regardless of the different ways we have tried deploying a jar
initially my there is no dir, directory which created by spark job. it
should empty while job execution. df write itself create first file and
trying to overwrite it seems.
On Fri, Jan 13, 2017 at 11:42 AM, Amrit Jangid
wrote:
> Hi Rajendra,
>
> It says your
Hi Rajendra,
It says your directory is not empty *s3n://**buccketName/cip/daily_date.*
Try to use save *mode. eg *
df.write.mode(SaveMode.Overwrite).partitionBy("date").f
ormat("com.databricks.spark.csv").option("delimiter", "#").option("codec", "
Hi team,
I am reading N number of csv and writing file based date partition. date is
one column, it has integer value(ex 20170101)
val df = spark.read
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter","#")
.option("nullValue","")
Yes, I save it to S3 in a different process. It is actually the
RandomForestClassificationModel.load method (passed an s3 path) where I run
into problems.
When you say you load it during map stages, do you mean that you are able
to directly load a model from inside of a transformation? When I try
Thank you Nicholas , if the sourcedata was csv format ,CSV reader works well.
2017-01-13
lk_spark
发件人:Nicholas Hakobian
发送时间:2017-01-13 08:35
主题:Re: Re: Re: how to change datatype by useing StructType
收件人:"lk_spark"
抄送:"ayan
Hi
Given training and predictions are two different applications, I typically
save model objects to hdfs and load it back during prediction map stages.
Best
Ayan
On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh wrote:
> Hi all,
> I've been working with Spark mllib 2.0.2
Hi Everyone,
Is there any suggestion for dbScan scala implementation?
My application code is running on Spark 2.0 but any suggestion is fine.
--
Regards ,
Shobhit G
Hi All -
I'm having an issue with detecting a failed Spark application state when
using the startApplication method and SparkAppHandle with the SparkLauncher
in Spark 2.0.1.
Previous I had used a Java Process to waitFor it to return an non-zero exit
code to detect failure which worked. But when
Have you tried the native CSV reader (in spark 2) or the Databricks CSV
reader (in 1.6).
If your format is in a CSV like format it'll load it directly into a
DataFrame. Its possible you have some rows where types are inconsistent.
Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally
If the executor reports a different hostname inside the CNI container, then
no, I don't think so.
On Thu, Jan 12, 2017 at 2:28 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:
> So even if I make the Spark executors run on the same node as Casssandra
> nodes, I am not sure each
So even if I make the Spark executors run on the same node as Casssandra
nodes, I am not sure each worker will connect to c* nodes on the same mesos
agent ?
2017-01-12 21:13 GMT+01:00 Michael Gummelt :
> The code in there w/ docs that reference CNI doesn't actually run
Hi all,
We need to use the rand() function in Scala Spark SQL in our
application, but we discovered that it behavior was not deterministic, that
is, different invocations with the same would result in different
values. This is documented in some bugs, for example:
The code in there w/ docs that reference CNI doesn't actually run when CNI
is in effect, and doesn't have anything to do with locality. It's just
making Spark work in a no-DNS environment
On Thu, Jan 12, 2017 at 12:04 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:
> I have
I have found this but I am not sure how it can help...
https://github.com/mesosphere/spark-build/blob/a9efef8850976f787956660262f3b77cd636f3f5/conf/spark-env.sh
2017-01-12 20:16 GMT+01:00 Michael Gummelt :
> That's a good point. I hadn't considered the locality
See "API compatibility" in http://spark.apache.org/versioning-policy.html
While code that is annotated as Experimental is still a good faith effort
to provide a stable and useful API, the fact is that we're not yet
confident enough that we've got the public API in exactly the form that we
want to
That's a good point. I hadn't considered the locality implications of CNI
yet. I think tasks are placed based on the hostname reported by the
executor, which in a CNI container will be different than the
HDFS/Cassandra hostname. I'm not aware of anyone running Spark+CNI in prod
yet, either.
Hi all,
I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
I encountered two frustrating issues and would really appreciate some
advice:
1) RandomForestClassificationModel is effectively not serializable (I
assume it's referencing something that can't be serialized, since
Page 33 of the Sparkling Water Booklet:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/SparklingWaterBooklet.pdf
df = sqlContext.read.format("h2o").option("key",frame.frame_id).load()
df = sqlContext.read.format("h2o").load(frame.frame_id)
On Thu, Jan 12, 2017 at 1:17 PM, Md. Rezaul
Hi there,
Is there any way to convert an H2O DataFrame to equivalent Spark RDD or
DataFrame? I found a good documentation on "*Machine Learning with
Sparkling Water: H2O + Spark*" here at.
Nice it worked !!
thx
Jorge Machado
www.jmachado.me
> On 12 Jan 2017, at 17:46, Asher Krim wrote:
>
> Have you tried using an alias? You should be able to replace
> ("dbtable”,"sometable") with ("dbtable”,"SELECT utc_timestamp AS my_timestamp
> FROM sometable")
>
>
You can find the Spark version of spark-submit in the log. Could you check
if it's not consistent?
On Thu, Jan 12, 2017 at 7:35 AM Ramkumar Venkataraman <
ram.the.m...@gmail.com> wrote:
> Spark: 1.6.1
>
> I am trying to use the new mapWithState API and I am getting the following
> error:
>
>
You may have tested this code on Spark version on your local machine
version of which may be different to whats in Google Cloud Storage.
You need to select appropraite Spark version when you submit your job.
On 12 January 2017 at 15:51, Anahita Talebi
wrote:
> Dear
Hi there,
When I do rdd map with more than 22 columns - I get "error: too many
arguments for unapply pattern, maximum = 22".
scala> val rddRes=rows.map{case Row(col1,..col23) => Row(...)}
Is there a known way to get around this issue?
p.s. Here is a full traceback:
From: Harjit Singh [mailto:harjit.si...@deciphernow.com]
Sent: Tuesday, April 26, 2016 3:11 PM
To: user@spark.apache.org
Subject: test
Have you tried using an alias? You should be able to replace
("dbtable”,"sometable")
with ("dbtable”,"SELECT utc_timestamp AS my_timestamp FROM sometable")
--
Asher Krim
Senior Software Engineer
On Thu, Jan 12, 2017 at 10:49 AM, Jorge Machado wrote:
> Hi Guys,
>
> I’m having a
I get the correct result occasionally. Try more often. Or increase the
range number. Then it will break down eventually.
I use spark 2.0.0.2.5 in yarn client mode.
Am 12.01.2017 15:13 schrieb "Takeshi Yamamuro" :
Hi,
I got the correct answer. Did I miss something?
//
Hi all,
Does anyone have experience running Spark on Mesos with CNI (ip per
container) ?
How would Spark use IP or hostname for data locality with backend framework
like HDFS or Cassandra ?
V
Dear all,
I am trying to run a .jar file as a job using submit job in google cloud
console.
https://cloud.google.com/dataproc/docs/guides/submit-job
I actually ran the spark code on my local computer to generate a .jar file.
Then in the Argument folder, I give the value of the arguments that I
Hi Guys,
I’m having a issue loading data with a jdbc connector
My line of code is :
val df =
sqlContext.read.format("jdbc").option("url","jdbc:mysql://localhost:3306
What version of Spark are you on?
Although it's cut off, I think your error is with RandomForestClassifier,
is that correct? If so, you should upgrade to spark 2 since I think this
class only became writeable/readable in Spark 2 (
https://github.com/apache/spark/pull/12118)
On Thu, Jan 12, 2017
Spark: 1.6.1
I am trying to use the new mapWithState API and I am getting the following
error:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/streaming/StateSpec$
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.streaming.StateSpec$
Build.sbt
Hi All,
Is there any support of theta join in SPARK. We want to identify the
country based on range on IP Address (we have in our DB)
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi,
I have executed my spark job using spark-submit on my local machine and on
cluster.
Now I want to try using HDFS. I mean put the data (text file) on hdfs and
read from there, execute the jar file and finally write the output to hdfs.
I got this error after running the job:
*failed to launch
Hi,
I got the correct answer. Did I miss something?
// maropu
---
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
Hello,
I am new to spark.
I need to run a spark job within oozie.
individually i am able to run the spark job but with oozie after the job is
launched i am getting the following error:
017-01-12 13:51:57,696 INFO [main] org.apache.hadoop.service.AbstractService:
Service
Hi Phadnis,
I found this in
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html:
> This version of the integration is marked as experimental, so the API is
> potentially subject to change.
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering
Hello,
We are using Spark 2.0 with Kafka 0.10.
As I understand, much of the API packaged in the following dependency we are
targeting is marked as "@Experimental"
org.apache.spark
spark-streaming-kafka-0-10_2.11
2.0.0
What are implications of this being marked as experimental?
Hi Malshan,
The error says that one (or more) of the estimators/stages is either not
writable or compatible that supports overwrite/model write operation.
Suppose you want to configure an ML pipeline consisting of three stages
(i.e. estimator): tokenizer, hashingTF, and nb:
val nb = new
Just in case you are more comfortable with SQL,
row_number over ()
should also generate an unique id.
On Thu, Jan 12, 2017 at 7:00 PM, akbar501 wrote:
> The following are 2 different approaches to adding an id/index to RDDs and
> 1
> approach to adding an index to a
I have try like this:
val peopleRDD = spark.sparkContext.textFile("/sourcedata/test/test*")
val rowRDD = peopleRDD.map(_.split(",")).map(attributes => {
val ab = ArrayBuffer[Any]()
for (i <- 0 until schemaType.length) {
if
Hi,
When I try to save a pipeline model using spark ML (Java) , the following
exception is thrown.
java.lang.UnsupportedOperationException: Pipeline write will fail on this
Pipeline because it contains a stage which does not implement Writable.
Non-Writable stage: rfc_98f8c9e0bd04 of type class
Hi,
When I try to save a pipeline model using spark ML (Java) , the following
exception is thrown.
java.lang.UnsupportedOperationException: Pipeline write will fail on this
Pipeline because it contains a stage which does not implement Writable.
Non-Writable stage: rfc_98f8c9e0bd04 of type class
The following are 2 different approaches to adding an id/index to RDDs and 1
approach to adding an index to a DataFrame.
Add an index column to an RDD
```scala
// RDD
val dataRDD = sc.textFile("./README.md")
// Add index then set index as key in map() transformation
// Results in RDD[(Long,
46 matches
Mail list logo