I would import data via sqoop and put it on HDFS. It has some mechanisms to
handle the lack of reliability by jdbc.
Then you can process the data via Spark. You could also use jdbc rdd but I do
not recommend to use it, because you do not want to pull data all the time out
of the database when
I may be wrong here, but beeline is basically a client library. So you
"connect" to STS and/or HS2 using beeline.
Spark connecting to jdbc is different discussion and no way related to
beeline. When you read data from DB (Oracle, DB2 etc) then you do not use
beeline, but use jdbc connection to
I have an RDD[String, MyObj] which is a result of Join + Map operation. It
has no partitioner info. I run reduceByKey without passing any Partitioner
or partition counts. I observed that output aggregation result for given
key is incorrect sometime. like 1 out of 5 times. It looks like reduce
hi ,
i am pyspark user and i want to extract var imprtance in randomforest
model for plot
how i can deal with that ?
thanks
Hi,
I'm trying to use the BinaryClassificationMetrics class to compute the pr
curve as below -
import org.apache.avro.generic.GenericRecord
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
Sorry, I think you misunderstood.
Spark can read from JDBC sources so to say using beeline as a way to access
data is not a spark application isn’t really true. Would you say the same if
you were pulling data in to spark from Oracle or DB2?
There are a couple of different design patterns and
Please check the driver and executor log, there should be logs about where
the data is written.
On Wed, Jun 22, 2016 at 2:03 AM, Pierre Villard wrote:
> Hi,
>
> I have a Spark job writing files to HDFS using .saveAsHadoopFile method.
>
> If I run my job in
Hi Ted,
I didn't type any command, it just threw that exception after launched.
Thanks!
On Mon, Jun 20, 2016 at 7:18 PM, Ted Yu wrote:
> What operations did you run in the Spark shell ?
>
> It would be easier for other people to reproduce using your code snippet.
>
>
1. Yes, in the sense you control number of executors from spark application
config.
2. Any IO will be done from executors (never ever on driver, unless you
explicitly call collect()). For example, connection to a DB happens one for
each worker (and used by local executors). Also, if you run a
Hi,
I have written a program using SparkIMain which creates an RDD and I am
looking for a way to access that RDD in my normal spark/scala code for
further processing.
The code below binds the SparkContext::
sparkIMain.bind("sc", "org.apache.spark.SparkContext", sparkContext,
By repartition I think you mean coalesce() where you would get one parquet file
per partition?
And this would be a new immutable copy so that you would want to write this new
RDD to a different HDFS directory?
-Mike
> On Jun 21, 2016, at 8:06 AM, Eugene Morozov
Ok, its at the end of the day and I’m trying to make sure I understand the
locale of where things are running.
I have an application where I have to query a bunch of sources, creating some
RDDs and then I need to join off the RDDs and some other lookup tables.
Yarn has two modes… client and
Hi List
I am trying to install the Spark-Cassandra connector through maven or sbt but
neither works.
Both of them try to connect to the Internet (which I do not have connection) to
download certain files.
Is there a way to install the files manually?
I downloaded from the maven repository -->
Thanks @Cody, I will try that out. In the interm, I tried to validate
my Hbase cluster by running a random write test and see 30-40K writes
per second. This suggests there is noticeable room for improvement.
On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger wrote:
> Take HBase
Take HBase out of the equation and just measure what your read
performance is by doing something like
createDirectStream(...).foreach(_.println)
not take() or print()
On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams wrote:
> @Cody I was able to bring my processing time
@Cody I was able to bring my processing time down to a second by
setting maxRatePerPartition as discussed. My bad that I didn't
recognize it as the cause of my scheduling delay.
Since then I've tried experimenting with a larger Spark Context
duration. I've been trying to get some noticeable
Use `withColumn`. It will replace a column if you give it the same name.
On Tue, Jun 21, 2016 at 4:16 AM, pseudo oduesp
wrote:
> Hi ,
> with fillna we can select some columns to perform replace some values
> with chosing columns with dict
> {columns :values }
> but
I'm using the Spark Thrift server to execute SQL queries over JDBC.
I'm wondering if it's possible to plugin a class to do some
pre-processing on the SQL statement before it gets passed to the
SQLContext for actual execution? I scanned over the code and it
doesn't look like this is supported but I
Hi,
I have a Spark job writing files to HDFS using .saveAsHadoopFile method.
If I run my job in local/client mode, it works as expected and I get all my
files written in HDFS. However if I change to yarn/cluster mode, I don't
see any error logs (the job is successful) and there is no files
To answer more accurately to your question, the model.fit(df) method takes
in a DataFrame of Row(label=double, feature=Vectors.dense([...])) .
cheers,
Ardo.
On Tue, Jun 21, 2016 at 6:44 PM, Ndjido Ardo BAR wrote:
> Hi,
>
> You can use a RDD of LabelPoints to fit your model.
Hi,
You can use a RDD of LabelPoints to fit your model. Check the doc for more
example :
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=transform#pyspark.ml.classification.RandomForestClassificationModel.transform
cheers,
Ardo.
On Tue, Jun 21, 2016 at 6:12 PM, pseudo
Hi,
i am pyspark user and i want test Randomforest.
i have dataframe with 100 columns
i should give Rdd or data frame to algorithme i transformed my dataframe to
only tow columns
label ands features columns
df.label df.features
0(517,(0,1,2,333,56 ...
1
Hi,
I wonder if it is possible to checkpoint only metadata and not the data in
RDD's and dataframes.
Thanks,
Natu
Apurva,
I'd say you have to apply repartition just once to the RDD that is union of
all your files.
And it has to be done right before you do anything else.
If something is not needed on your files, then the sooner you project, the
better.
Hope, this helps.
--
Be well!
Jean Morozov
On Tue,
If you're using the direct stream, and don't have speculative
execution turned on, there is one executor consumer created per
partition, plus a driver consumer for getting the latest offsets. If
you have fewer executors than partitions, not all of those consumers
will be running at the same time.
I use Spark Streaming with Kafka and I'd like to know how many consumers
are generated. I guess that as many as partitions in Kafka but I'm not
sure.
Is there a way to know the name of the groupId generated in Spark to Kafka?
hi,
help me please to resolve this issues
Just clamp the predicted value to 0?
But you may also want to reconsider your model in this case; maybe a
simple linear model is not appropriate.
On Tue, Jun 21, 2016 at 2:03 PM, diplomatic Guru
wrote:
> Hello Sean,
>
> Absolutely, there is nothing wrong with predicting
Hello,
I am trying to combine several small text files (each file is approx
hundreds of MBs to 2-3 gigs) into one big parquet file.
I am loading each one of them and trying to take a union, however this
leads to enormous amounts of partitions, as union keeps on adding the
partitions of the input
Hi,
can somebody please explain the way FullOuterJoin works on Spark? Does each
intersection get fully loaded to memory?
My problem is as follows:
I have two large data-sets:
* a list of web pages,
* a list of domain-names with specific rules for processing pages from that
domain.
I am
Hello Sean,
Absolutely, there is nothing wrong with predicting negative values but for
my scenario I do not want to predict any negative value (also, all the data
that is fed into the model is positive). Is there any way I could stop
predicting negative values. I assume it is not possible but
Hi,
It looks like this is not related to Alluxio. Have you tried running the
same job with different storage?
Maybe you could increase the Spark JVM heap size to see if that helps your
issue?
Hope that helps,
Gene
On Wed, Jun 15, 2016 at 8:52 PM, Chanh Le wrote:
> Hi
There's nothing inherently wrong with a regression predicting a
negative value. What is the issue, more specifically?
On Tue, Jun 21, 2016 at 1:38 PM, diplomatic Guru
wrote:
> Hello all,
>
> I have a job for forecasting using linear regression, but sometimes I'm
>
Hello all,
I have a job for forecasting using linear regression, but sometimes I'm
getting a negative prediction. How do I prevent this?
Thanks.
Hi ,
with fillna we can select some columns to perform replace some values
with chosing columns with dict
{columns :values }
but how i can do same with cast i have data frame with 300 columns and i
want just cats 4 from list columns but with select query like that :
Are you using 1.6.1 ?
If not, does the problem persist when you use 1.6.1 ?
Thanks
> On Jun 20, 2016, at 11:16 PM, umanga wrote:
>
> I am getting following warning while running stateful computation. The state
> consists of BloomFilter (stream-lib) as Value and Integer
further descriptions:
Environment: Spark cluster running in standalone mode with 1 master, 5
slaves, each has 4 vCPUS, 8GB RAM
data is being streamed from 3 node kafka cluster (managed by 3 node zk
cluster).
Checkpointing is being done at hadoop-cluster,
plus we are also saving state in HBase
Looks like this issue https://issues.apache.org/jira/browse/SPARK-10309
On Mon, Jun 20, 2016 at 4:27 PM, pseudo oduesp
wrote:
> Hi ,
> i don t have no idea why i get this error
>
>
>
> Py4JJavaError: An error occurred while calling o69143.parquet.
> :
thanks Robin.
This is data from Hive (source) to Hive (Target) via Spark. The database in
Hive is called oraclehadoop (mainly used to import data from Oracle in the
first place)
I am very sceptical of these methods in Spark pertaining to store data in
Hive database. I all probability they just
You need to send an email to user-unsubscr...@spark.apache.org for
unsubscribing. Read more over here http://spark.apache.org/community.html
On Mon, Jun 20, 2016 at 1:10 PM, Ram Krishna
wrote:
> Hi Sir,
>
> Please unsubscribe me
>
> --
> Regards,
> Ram Krishna KT
>
>
if you are able to trace the underlying oracle session you can see whether a
commit has been called or not.
> On 21 Jun 2016, at 09:57, Robin East wrote:
>
> I’m not sure - I don’t know what those APIs do under the hood. It simply rang
> a bell with something I have
I’m not sure - I don’t know what those APIs do under the hood. It simply rang a
bell with something I have fallen foul of in the past (not with Spark though) -
have wasted many hours forgetting to commit and then scratching my head as why
my data is not persisting.
> On 21 Jun 2016, at
Hi,
Can someone please look into this and tell me whats wrong?and why am I not
getting any output?
Thanks & Regards
Biplob Biswas
On Sun, Jun 19, 2016 at 1:29 PM, Biplob Biswas
wrote:
> Hi,
>
> Thanks for that input, I tried doing that but apparently thats not
that is a very interesting point. I am not sure. how can I do that with
sorted.save("oraclehadoop.sales2")
like .. commit?
thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
random thought - do you need an explicit commit with the 2nd method?
> On 20 Jun 2016, at 21:35, Mich Talebzadeh wrote:
>
> Hi,
>
> I have a DF based on a table and sorted and shown below
>
> This is fine and when I register as tempTable I can populate the
hi ,
realy i m angry about parquet file each time i get error like
Could not read footer: java.lang.RuntimeException:
or error occuring when o127.load
why we have à lot of issuse with this format ?
thanks
Hello guys,
I would like to know if there is a way in standalone mode to run spark
executor processes using user/group of the user that have submitted the job?
I found an old opened wish that exactly describes my need but i want to
be sure that it's still not possible before thinking
I am getting following warning while running stateful computation. The state
consists of BloomFilter (stream-lib) as Value and Integer as key.
The program runs smoothly for few minutes and after that, i am getting this
warning, and streaming app becomes unstable (processing time increases
48 matches
Mail list logo