Hi,
I have a dataframe that contain an embedded json string in one of the fields
I'd tried to write a UDF function that will parse it using lift-json, but
it seems to take a very long time to process, and it seems that only the
master node is working.
Has anyone dealt with such a scenario before
Hi Mich,
Can you try placing these jars in Spark Classpath.
It should work .
Thanks,
Divya
On 22 December 2016 at 05:40, Mich Talebzadeh
wrote:
> This works with Spark 2 with Oracle jar file added to
>
> $SPARK_HOME/conf/ spark-defaults.conf
>
>
>
>
>
Hi All,
Version:
Spark 1.5.1
Hadoop 2.7.2
Is there any way to submit and monitor spark task on yarn via java
asynchronously?
In Spark 1.6.2, it was possible to access the HiveConf object via the below
method.
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/hive/HiveContext.html#hiveconf()
Can anyone let me know how do the same in Spark 2.0.2, from the
SparkSession object?
In Spark 1.6.2, it was possible to access the HiveConf object via the below
method.
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/hive/HiveContext.html#hiveconf()
Can anyone let me know how do the same in Spark 2.0.2, from the SparkSession
object?
--
View this message in
I am debugging problems with a PySpark RandomForestClassificationModel, and
I am trying to use the feature importances to do so. However, the
featureImportances property returns a SparseVector that isn't possible to
interpret. How can I transform the SparseVector to be a useful list of
features
thanks Ayan, do you mean
"driver" -> "oracle.jdbc.OracleDriver"
we added that one but did not work!
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Try providing correct driver name through property variable in the jdbc
call.
On Thu., 22 Dec. 2016 at 8:40 am, Mich Talebzadeh
wrote:
> This works with Spark 2 with Oracle jar file added to
>
>
>
>
>
> $SPARK_HOME/conf/ spark -defaults.conf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
This works with Spark 2 with Oracle jar file added to
$SPARK_HOME/conf/ spark-defaults.conf
spark.driver.extraClassPath /home/hduser/jars/ojdbc6.jar
spark.executor.extraClassPath/home/hduser/jars/ojdbc6.jar
and you get
cala> val s = HiveContext.read.format("jdbc").options(
Do you know who can I talk to about this code? I am rally curious to know why
there is a join and why number of partition for join is the sum of both of
them, I expected to see that number of partitions should be the same as the
streamed table ,or worst case multiplied.
Sent from my iPhone
summary: Spark-shell fails to redefine values in some cases, this is at
least found in a case where "implicit" is involved, but not limited to such
cases
run the following in spark-shell, u can see that the last redefinition does
not take effect. the same code runs in plain scala REPL without
Hi,
I am running linear regression on a dataframe and get the following error:
Exception in thread "main" java.lang.AssertionError: assertion failed:
Training dataset is empty.
at scala.Predef$.assert(Predef.scala:170)
at
I tried caching the parent data set but it slows down the execution time, last
column in the input data set is double array and requirement is to add last
column double array after doing group by. I have implemented an aggregation
function which adds the last column. Hence the query is
Select
Hi All,
I have an requirement where I have to run 100 group by queries with different
columns I have generated the parquet file which has 30 columns I see every
parquet files has different size and 200 files are generated, my question is
what is the best approach to run group by queries on
Incremental load traditionally means generating hfiles and
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
data into hbase.
For your use case, the producer needs to find rows where the flag is 0 or 1.
After such rows are obtained, it is up to you how the result of
Check the source code for SparkLauncher:
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java#L541
a separate process will be started using `spark-submit` and if it uses
`yarn-cluster` mode, a driver may be launched on another NodeManager
I already set
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
to enable kryo and
.set("spark.kryo.registrationRequired", "true")
to force kryo. Strangely, I see the issue of this missing Dataset[]
Trying to register regular classes like Date
You better ask folks in the spark-jobserver gitter channel:
https://github.com/spark-jobserver/spark-jobserver
On Wed, Dec 21, 2016 at 8:02 AM, Reza zade wrote:
> Hello
>
> I've extended the JavaSparkJob (job-server-0.6.2) and created an object
> of SQLContext class. my
to enable kryo serializer you just need to pass
`spark.serializer=org.apache.spark.serializer.KryoSerializer`
the `spark.kryo.registrationRequired` controls the following behavior:
Whether to require registration with Kryo. If set to 'true', Kryo will
> throw an exception if an unregistered
Thank you Nick that is good to know.
Would this have some opportunity for newbs (like me) to volunteer some time?
Sent from my iPhone
> On Dec 21, 2016, at 9:08 AM, Nick Pentreath wrote:
>
> It is part of the general feature parity roadmap. I can't recall offhand any
Ok, Sure will ask.
But what would be generic best practice solution for Incremental load from
HBASE.
On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote:
> I haven't used Gobblin.
> You can consider asking Gobblin mailing list of the first option.
>
> The second option would
You can track https://issues.apache.org/jira/browse/SPARK-15784 for the
progress.
On Wed, Dec 21, 2016 at 7:08 AM, Nick Pentreath
wrote:
> It is part of the general feature parity roadmap. I can't recall offhand
> any blocker reasons it's just resources
> On Wed, 21
am having trouble with streaming performance. My main problem is how to do a
sliding window calculation where the ratio between the window size and the step
size is relatively large (hundreds) without recalculating everything all the
time.
I created a simple example of what I am aiming at with
Spark isn't a storage system -- it's a batch processing system at heart. To
"serve" something means to run a distributed computation scanning
partitions for an element and collect it to a driver and return it.
Although that could be fast-enough for some definition of fast, it's going
to be orders
Hello,
I had a discussion today with a colleague who was saying the following:
"We can use Spark as fast serving layer in our architecture, that is we can
compute an RDD or even a dataset using Spark SQL,
then we can cache it and offering to the front end layer an access to our
application in
I haven't used Gobblin.
You can consider asking Gobblin mailing list of the first option.
The second option would work.
On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri
wrote:
> Hello Guys,
>
> I would like to understand different approach for Distributed
It is part of the general feature parity roadmap. I can't recall offhand
any blocker reasons it's just resources
On Wed, 21 Dec 2016 at 17:05, Robert Hamilton
wrote:
> Hi all. Is it on the roadmap to have an
> Spark.ml.clustering.PowerIterationClustering? Are there
Hi all. Is it on the roadmap to have an
Spark.ml.clustering.PowerIterationClustering? Are there technical reasons that
there is currently only an .mllib version?
Sent from my iPhone
-
To unsubscribe e-mail:
To force spark to use kryo serialization I set
spark.kryo.registrationRequired to true.
Now spark complains that: Class is not registered:
org.apache.spark.sql.types.DataType[] is not registered.
How can I fix this? So far I could not successfully register this class.
--
View this message in
Hi,
When i am trying to write dataset to parquet or to show(1,fase), my job
throws array out of bounce exception.
16/12/21 12:38:50 WARN TaskSetManager: Lost task 7.0 in stage 36.0 (TID 81,
ip-10-95-36-69.dev): java.lang.ArrayIndexOutOfBoundsException: 63
at
Thanks Liang!
I get your point. It would mean that when launching spark jobs, mode needs
to be specified as client for all spark jobs.
However, my concern is to know if driver's memory(which is launching spark
jobs) will be used completely by the Future's(sparkcontext's) or these
spawned
Hi Sebastian,
Yes, for fetching the details from Hive and HBase, I would want to use
Spark's HiveContext etc.
However, based on your point, I might have to check if JDBC based driver
connection could be used to do the same.
Main reason for this is to avoid a client-server architecture design.
Hello
I've extended the JavaSparkJob (job-server-0.6.2) and created an object
of SQLContext class. my maven project doesn't have any problem during
compile and packaging phase. but when I send .jar of project to sjs and run
it "NoClassDefFoundError" will be issued. the trace of exception is :
I have two dataframes which I am joining. small and big size dataframess. The
optimizer suggest to use BroadcastNestedLoopJoin.
number of partitions for the big Dataframe is 200 while small Dataframe has 5
partitions.
The joined dataframe results with 205 partitions
Is there any reason you need a context on the application launching the
jobs?
You can use SparkLauncher in a normal app and just listen for state
transitions
On Wed, 21 Dec 2016, 11:44 Naveen, wrote:
> Hi Team,
>
> Thanks for your responses.
> Let me give more details in
Hi Team,
Thanks for your responses.
Let me give more details in a picture of how I am trying to launch jobs.
Main spark job will launch other spark-job similar to calling multiple
spark-submit within a Spark driver program.
These spawned threads for new jobs will be totally different components,
@Sean perhaps I could leverage this when this
http://openjdk.java.net/jeps/261 becomes available.
On Fri, Dec 16, 2016 at 4:05 AM, Steve Loughran
wrote:
> FWIW, although the underlying Hadoop declared guava dependency is pretty
> low, everything in org.apache.hadoop is
Hello Guys,
I would like to understand different approach for Distributed Incremental
load from HBase, Is there any *tool / incubactor tool* which satisfy
requirement ?
*Approach 1:*
Write Kafka Producer and maintain manually column flag for events and
ingest it with Linkedin Gobblin to HDFS /
I am not familiar of any problem with that.
Anyway, If you run spark applicaction you would have multiple jobs, which makes
sense that it is not a problem.
Thanks David.
From: Naveen [mailto:hadoopst...@gmail.com]
Sent: Wednesday, December 21, 2016 9:18 AM
To: d...@spark.apache.org;
39 matches
Mail list logo