you can get it from the SparkSession for backwards compatibility:
val sqlContext = spark.sqlContext
On Mon, Aug 8, 2016 at 9:11 AM, Mich Talebzadeh
wrote:
> Hi,
>
> In Spark 1.6.1 this worked
>
> scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(),
Hi,
Also, in shell you have sql function available without the object.
Jacek
On 8 Aug 2016 6:11 a.m., "Mich Talebzadeh"
wrote:
> Hi,
>
> In Spark 1.6.1 this worked
>
> scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/
> HH:mm:ss.ss')
What about the following :
val sqlContext = spark
?
On 8 Aug 2016 6:11 a.m., "Mich Talebzadeh"
wrote:
> Hi,
>
> In Spark 1.6.1 this worked
>
> scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/
> HH:mm:ss.ss') ").collect.foreach(println)
>
Hi,
I think it's cluster deploy mode.
spark-submit --deploy-mode cluster --master yarn myStreamingApp.jar
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On
Hi,
How do you access HBase? What's the version of Spark?
(I don't see spark packages in the stack trace)
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On
Hi Sreekanth,
Assuming you are using Spark 1.x,
I believe this code below:
sqlContext.read.format("com.databricks.spark.xml").option("rowTag",
"emp").load("/tmp/sample.xml")
.selectExpr("manager.id", "manager.name",
"explode(manager.subordinates.clerk) as clerk")
.selectExpr("id", "name",
Hi Mich,
File a JIRA issue as that seems as if they overlooked that part. Spark
2.0 has less and less HiveQL with more and more native support.
(My take on this is that the days of Hive in Spark are counted and
Hive is gonna disappear soon)
Pozdrawiam,
Jacek Laskowski
Hi,
I'm sure that cluster deploy mode would solve it very well. It'd be a
cluster issue then to re-execute the driver then?
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at
Hi,
You need to add spark-* libs to the project (using the
Eclipse-specific way). The best bet would be to use maven as the main
project management tool and import the project then.
Seen
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse
?
Hi David,
Thank you for detailed reply. I understand what you said about the ideas on
broadcast variable. But I am still a little bit confused. In your reply, you
said:
*It has sent largeValue across the network to each worker already, and gave
you a/ key /to retrieve it.*
So my question is,
I tried using *sparkxgboost package *in build.sbt file but it failed.
Spark 2.0
Scala 2.11.8
Error:
[warn]
http://dl.bintray.com/spark-packages/maven/rotationsymmetry/sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.1-s_2.10-javadoc.jar
[warn]
Hi,
Is there a way to get access to SparkConfig from mapWithState function? I am
looking to implement logic using the config property in mapWithState function.
Thanks,
Raj
Hi all,
Here is a simplified example to show my concern. This example contains 3
files with 3 objects, depending on spark 1.6.1.
//file globalObject.scala
import org.apache.spark.broadcast.Broadcast
object globalObject {
var br_value: Broadcast[Map[Int, Double]] = null
}
//file
Hi Folks,
I am trying flatten variety of XMLs using DataFrames. I'm using spark-xml
package which is automatically inferring my schema and creating a DataFrame.
I do not want to hard code any column names in DataFrame as I have lot of
varieties of XML documents and each might be lot more
You can use dot notations.
select myList.vList from t where myList.nm=IP'
On Fri, Aug 12, 2016 at 9:11 AM, vr spark wrote:
> Hi Experts,
> Please suggest
>
> On Thu, Aug 11, 2016 at 7:54 AM, vr spark wrote:
>
>>
>> I have data which is json in this
Hi folks,
I am using Spark streaming, and I am not clear if there is smart way to
restart the app once it fails, currently we just have one cron job to check
if the job is running every 2 or 5 minutes and restart the app when
necessary.
According to spark streaming guide:
- *YARN* - Yarn
Is there a dataframe version of XGBoost in spark-ml ?.
Has anyone used sparkxgboost package ?
I see a similar issue being resolved recently:
https://issues.apache.org/jira/browse/SPARK-15285
On Fri, Aug 12, 2016 at 3:33 PM, Aris wrote:
> Hello folks,
>
> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
> smaller data unit tests work on my
Hrrm, that's interesting. Did you try with subscribe pattern, out of
curiosity?
I haven't tested repartitioning on the underlying new Kafka consumer, so
its possible I misunderstood something.
On Aug 12, 2016 2:47 PM, "Srikanth" wrote:
> I did try a test with spark 2.0 +
Hello folks,
I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
smaller data unit tests work on my laptop, when I'm on a cluster, I get
cryptic error messages:
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
>
I did try a test with spark 2.0 + spark-streaming-kafka-0-10-assembly.
Partition was increased using "bin/kafka-topics.sh --alter" after spark job
was started.
I don't see messages from new partitions in the DStream.
KafkaUtils.createDirectStream[Array[Byte], Array[Byte]] (
> ssc,
Great.
I like your second solution. But how can I make sure that cvModel holds
the best model overall (as opposed to the last one that was tired out
but the grid search)?
In addition, do you have an idea how to collect the average error of
each grid search (here 1x1x1)?
On 12/08/2016
On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker wrote:
> val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
> "a"))).toDF("x", "y")
> val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))
>
This workaround executes with no exceptions:
val
Are you checkpointing?
Beyond that, why are you using createStream instead of createDirectStream
On Fri, Aug 12, 2016 at 12:32 PM, Diwakar Dhanuskodi
wrote:
> Okay .
> I could delete the consumer group in zookeeper and start again to re
> use same consumer
You will need to cast bestModel to include the MLWritable trait. The class
Model does not mix it in by default. For instance:
cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path")
Alternatively, you could save the CV model directly, which takes care of
this
cvModel.save("/my/path")
On
Hi everyone,
I've started experimenting with my codebase to see how much work I will
need to port it from 1.6.1 to 2.0.0. In regressing some of my dataframe
transforms, I've discovered I can no longer pair a countDistinct with a
collect_set in the same aggregation.
Consider:
val df =
Hello all,
I am completely new to spark. I downloaded the spark project from github (
https://github.com/apache/spark) and wanted to run the examples. I
successfully ran the maven command:
build/mvn -DskipTests clean package
But I am not able to build the spark-examples_2.11 project. There are
UNSUBSCRIBE
I'm building an LDA Pipeline, currently with 4 steps, Tokenizer,
StopWordsRemover, CountVectorizer, and LDA. I would like to add more steps,
for example, stemming and lemmatization, and also 1-gram and 2-grams (which
I believe is not supported by the default NGram class). Is there a way to
add
Hi,
Assuming that I have run the following pipeline and have got the best logistic
regression model. How can I then save that model for later use? The following
command throws an error:
cvModel.bestModel.save("/my/path")
Also, is it possible to get the error (a collection of) for each
Hi there,
I have lots of raw data in several Hive tables where we built a workflow to
"join" those records together and restructured into HBase. It was done
using plain MapReduce to generate HFile, and then load incremental from
HFile into HBase to guarantee the best performance.
However, we
Hi, everyone.
According the official guide, I copied hdfs-site.xml, core-site.xml and
hive-site.xml to $SPARK_HOME/conf, and write code as below:
```Java
SparkSession spark = SparkSession
.builder()
.appName("Test Hive for Spark")
Hi Gourav,Thanks for your input. As mentioned previously, we've tried the
broadcast. We've failed to broadcast 1.5GB...perhaps some tuning can help. We
see CPU go up to 100%, and then workers die during the broadcast. I'm not sure
if it's a good idea to broadcast that much, as spark's broadcast
I'm using pyspark ML's logistic regression implementation to do some
classification on an AWS EMR Yarn cluster.
The cluster consists of 10 m3.xlarge nodes and is set up as follows:
spark.driver.memory 10g, spark.driver.cores 3 , spark.executor.memory 10g,
spark.executor-cores 4.
I enabled
OK, I've merged this PR to master and branch-2.0.
On 8/11/16 8:27 AM, Cheng Lian wrote:
Haven't figured out the exactly way how it failed, but the leading
underscore in the partition directory name looks suspicious. Could you
please try this PR to see whether it fixes the issue:
The point is that if you have skewed data then one single reducer will
finally take a very long time, and you do not need to try this even, just
search in Google and skewed data is a known problem in joins even in SPARK.
Therefore instead of using join, in case the used case permits, just write
a
Hi,
Have you looked at web UI? You should find such task metrics.
Jacek
On 11 Aug 2016 6:28 p.m., "Suman Somasundar"
wrote:
> Hi,
>
>
>
> While going through the logs of an application, I noticed that I could not
> find any logs to dig deeper into any of the
Hi Vinay,
just out of curiosity, why are you converting your Dataframes into RDDs
before the join? Join works quite well with Dataframes.
As for your problem, it looks like you gave to your executors more
memory than you physically have. As an example of executors
configuration:
> Cluster of 6
you could have a very large key? perhaps a token value?
i love the rdd api but have found that for joins dataframe/dataset performs
better. maybe can you do the joins in that?
On Thu, Aug 11, 2016 at 7:41 PM, Muttineni, Vinay
wrote:
> Hello,
>
> I have a spark job that
i generally like the type inference feature of the spark-sql csv
datasource, however i have been stung several times by date inference. the
problem is that when a column is converted to a date type the original data
is lost. this is not a lossless conversion. and i often have a requirement
where i
Hi Experts,
Please suggest
On Thu, Aug 11, 2016 at 7:54 AM, vr spark wrote:
>
> I have data which is json in this format
>
> myList: array
> |||-- elem: struct
> ||||-- nm: string (nullable = true)
> ||||-- vList: array (nullable =
Hi,
We are using spark 1.6.1 and kafka 0.9.
KafkaUtils.createStream is showing strange behaviour. Though
auto.offset.reset is set to smallest . Whenever we need to restart the
stream it is picking up the latest offset which is not expected.
Do we need to set any
Hi Ashic,
That is a pretty 2011 way of solving the problem, what is more painful
about this way of working is that you need to load the data in to REDIS,
keep a REDIS cluster running and in case you are workign across several
clusters then may be install REDIS in all of them or hammer your
43 matches
Mail list logo