Running the query below I have been hitting - local class incompatible
exception, anyone know the cause?
val rdd = csc.cassandraSql("""select *, concat('Q', d_qoy) as qoy from
store_sales join date_dim on ss_sold_date_sk = d_date_sk join item on
ss_item_sk =
Can you please check order of all the data set of union all operations.
Are they in same order ?
On 9 August 2016 at 02:47, max square wrote:
> Hey guys,
>
> I'm trying to save Dataframe in CSV format after performing unionAll
> operations on it.
> But I get this
So here is the test case from the commit adding the first/last methods here:
https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162
+ test("last/first with ignoreNulls") {
+val nullStr: String = null
+val df = Seq(
+ ("a", 0, nullStr),
Wondering why are you creating separate dstreams? You should apply the
logic directly on input dstream
On 18 Aug 2016 06:40, "vidhan" wrote:
> I have a *kafka* stream coming in with some input topic.
> This is the code i wrote for accepting *kafka* stream.
>
> *>>> conf =
Thanks Harsh for the reply.
When I change the code to something like this -
def saveAsLatest(df: DataFrame, fileSystem: FileSystem, bakDir: String) =
{
fileSystem.rename(new Path(bakDir + latest), new Path(bakDir + "/" +
ScalaUtil.currentDateTimeString))
fileSystem.create(new
I have a *kafka* stream coming in with some input topic.
This is the code i wrote for accepting *kafka* stream.
*>>> conf = SparkConf().setAppName(appname)
>>> sc = SparkContext(conf=conf)
>>> ssc = StreamingContext(sc)
>>> kvs = KafkaUtils.createDirectStream(ssc, topics,\
I'm using Spark 1.6.2 for Vector-based UDAF and this works:
def inputSchema: StructType = new StructType().add("input", new VectorUDT())
Maybe it was made private in 2.0
On 17 August 2016 at 05:31, Alexey Svyatkovskiy
wrote:
> Hi Yanbo,
>
> Thanks for your reply. I will
The OneHotEncoder does *not* accept multiple columns.
You can use Michal's suggestion where he uses Pipeline to set the stages
and then executes them.
The other option is to write a function that performs one hot encoding on a
column and returns a dataframe with the encoded column and then call
Hi Ted/All,
i did below to get fullstack and see below, not able to understand root
cause..
except Exception as error:
traceback.print_exc()
and this what i get...
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/context.py",
line 580, in sql
return
I had already tried this way :
scala> val featureCols = Array("category","newone")
featureCols: Array[String] = Array(category, newone)
scala> val indexer = new
StringIndexer().setInputCol(featureCols).setOutputCol("categoryIndex").fit(df1)
:29: error: type mismatch;
found : Array[String]
I don't think it does. From the documentation:
https://spark.apache.org/docs/2.0.0-preview/ml-features.html#onehotencoder,
I see that it still accepts one column at a time.
On Wed, Aug 17, 2016 at 10:18 AM, janardhan shetty
wrote:
> 2.0:
>
> One hot encoding currently
You can it just map over your columns and create a pipeline:
val columns = Array("colA", "colB", "colC")
val transformers: Array[PipelineStage] = columns.map {
x => new OneHotEncoder().setInputCol(x).setOutputCol(x + "Encoded")
}
val pipeline = new Pipeline()
.setStages(transformers)
On 17
Hi
I can see that exception is caused by following, csn you check where in
your code you are using this path
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
hdfs://testcluster:8020/experiments/vol/spark_chomp_data/bak/restaurants-bak/latest
On Wed, 17 Aug
/bump
It'd be great if someone can point me to the correct direction.
On Mon, Aug 8, 2016 at 5:07 PM, max square wrote:
> Here's the complete stacktrace - https://gist.github.com/rohann/
> 649b0fcc9d5062ef792eddebf5a315c1
>
> For reference, here's the complete function
hi i try to submit job spark with oozie. but i've got one problem here.
when i submit the same job. sometimes my job succeed but sometimes my
job was failed.
i've got this error message when the job was failed :
Spark Version : 1.5.0
Record:
01-Jan-16
Expected Result:
2016
I used the below code which is shared in user group.
from_unixtime(unix_timestamp($"Creation Date","dd-MMM-yy"),""))
is this right approach or do we have any other approach.
NOTE:
i tried *year() *function but it gives only
2.0:
One hot encoding currently accepts single input column is there a way to
include multiple columns ?
Hi folks,
Just a friendly message that we have added Python support to the REST
Spark Job Server project. If you are a Python user looking for a
RESTful way to manage your Spark jobs, please come have a look at our
project!
https://github.com/spark-jobserver/spark-jobserver
-Evan
I experienced very slow execution time
http://stackoverflow.com/questions/38803546/spark-r-2-0-dapply-very-slow
and wondering why...
On Wed, Aug 17, 2016 at 1:12 PM, Felix Cheung
wrote:
> This is supported in Spark 2.0.0 as dapply and gapply. Please see the API
>
My code is very simple, if i use other hive tables, my code works fine. This
particular table (virtual view) is huge and might have more metadata.
It has only two columns.
virtual view name is : cluster_table
# col_namedata_type
ln string
parti
Please include user@ in your reply.
Can you reveal the snippet of hive sql ?
On Wed, Aug 17, 2016 at 9:04 AM, vr spark wrote:
> spark 1.6.1
> mesos
> job is running for like 10-15 minutes and giving this message and i killed
> it.
>
> In this job, i am creating data frame
Hi,
I am using Spark 2.0 and I am getting unexpected results using the last()
method. Has anyone else experienced this? I get the sense that last() is
working correctly within a given data partition but not across the entire RDD.
First() seems to work as expected so I can work around this by
>From the spark
documentation(http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)
yes you can use persist on a dataframe instead of cache. All cache is, is a
shorthand for the default persist storage level "MEMORY_ONLY". If you want
to persist the dataframe to disk you
W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an unknown
offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910492
W0816 23:17:01.984987 16360 sched.cpp:1195] Attempting to accept an unknown
offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910493
W0816 23:17:01.985124 16360
Can you provide more information ?
Were you running on YARN ?
Which version of Spark are you using ?
Was your job failing ?
Thanks
On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote:
>
> W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an
> unknown offer
Can you show the complete stack trace ?
Which version of Spark are you using ?
Thanks
On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote:
> Hi,
> I am getting error on below scenario. Please suggest.
>
> i have a virtual view in hive
>
> view name log_data
> it has 2
Hi,
I am getting error on below scenario. Please suggest.
i have a virtual view in hive
view name log_data
it has 2 columns
query_map map
parti_date int
Here is my snippet for the spark data frame
my dataframe
res=sqlcont.sql("select parti_date FROM
Hello, I'd like to report a wrong behavior of DataSet's API, I don´t know
how I can do that. My Jira account doesn't allow me to add a Issue
I'm using Apache 2.0.0 but the problem came since at least version 1.4
(given the doc since 1.3)
The problem is simple to reporduce, also the work arround,
This is supported in Spark 2.0.0 as dapply and gapply. Please see the API doc:
https://spark.apache.org/docs/2.0.0/api/R/
Feedback welcome and appreciated!
_
From: Yogesh Vyas >
Sent: Tuesday, August 16, 2016 11:39 PM
Hi,
First I am not sure if I should inherit from InputDStream, or
ReceiverInputDStream. For ReceiverInputDStream, why would I want to run a
receiver on each worker nodes?
If I want to inherit InputDStream, what should I do in the comput() method?
--
Thanks,
David S.
Hi,
for small to medium size clusters I think Spark Standalone mode is a good
choice.
We are contemplating moving to Yarn as our cluster grows.
What are the pros and cons of using either please. Which one offers the best
Thanking you
Hello guys: I have a problem in loading recommend model. I have 2 models,
one is good(able to get recommend result) and another is not working. I checked
these 2 models, both are MatrixFactorizationModel object. But in the metadata,
one is a PipelineModel and another is a
Thank you for your comments
> You should just Seq(...).toDS
I tried this, however the result is not changed.
>> val ds2 = ds1.map(e => e)
> Why are you e => e (since it's identity) and does nothing?
Yes, e => e does nothing. For the sake of simplicity of an example, I used
the simplest
Hi,
Is there is any way of using UDF in SparkR ?
Regards,
Yogesh
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
My motivation is to simplify Java code generated by a compiler of
Tungsten.
Here is a dump of generated code from the program.
https://gist.github.com/kiszk/402bd8bc45a14be29acb3674ebc4df24
If we can succeeded to let catalyst the result of map is never null, we
can eliminate conditional
Hi Michael,
Thanks a lot for your help. See below explains for csv and text. Do
you see anything worth investigating?
scala> spark.read.csv("people.csv").cache.explain(extended = true)
== Parsed Logical Plan ==
Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
== Analyzed Logical Plan ==
_c0: string,
36 matches
Mail list logo