Is there a way to create multiple output files when connected from beeline to
the Thriftserver ?
Right now i am using beeline -e 'query' > output.txt which is not efficient
as it uses linux operator to combine output files .
--
View this message in context:
For #1, have you seen this JIRA ?
[SPARK-14867][BUILD] Remove `--force` option in `build/mvn`
On Thu, Apr 28, 2016 at 8:27 PM, Demon King wrote:
> BUG 1:
> I have installed maven 3.0.2 in system, When I using make-distribution.sh
> , it seem not use maven 3.2.2 but use
Hello Mike,
No problem, logs are useful to us anyway. Thank you for all the pointers.
We started off with examining only a single RDD but later on added a few
more. The persist count and unpersist count sequence is the dummy stage
that you suggested us to use to avoid the initial scheduler delay.
BUG 1:
I have installed maven 3.0.2 in system, When I using make-distribution.sh
, it seem not use maven 3.2.2 but use /usr/local/bin/mvn to build spark. So
I add --force option in make-distribution.sh like this:
line 130:
VERSION=$("$MVN" *--force* help:evaluate -Dexpression=project.version $@
Thanks for the responses.
Fatma
On Apr 28, 2016 3:00 PM, "Renato Perini" wrote:
> I have setup a small development cluster using t2.micro machines and an
> Amazon Linux AMI (CentOS 6.x).
> The whole setup has been done manually, without using the provided
> scripts. The
Look, you said that you didn’t have continuous data, and you do have continuous
data. I just used an analog signal which can be converted. So that you end up
with contiguous digital sampling.
The point is that you have to consider that micro batches are still batched and
you’re adding
I have setup a small development cluster using t2.micro machines and an
Amazon Linux AMI (CentOS 6.x).
The whole setup has been done manually, without using the provided
scripts. The whole setup is composed of a total of 5 instances: the
first machine has an elastic IP and it is used as a
Fatima, the easiest way to create Spark cluster on AWS is to create EMR
cluster and select Spark application. (the latest EMR includes Spark 1.6.1)
Spark works well with S3 (read and write). However it's recommended to
set spark.speculation true (it's expected that some tasks fail if you read
What is your experience using Spark on AWS? Are you setting up your own
Spark cluster, and using HDFS? Or are you using Spark as a service from
AWS? In the latter case, what is your experience of using S3 directly,
without having HDFS in between?
Thanks,
Fatma
Hello Benjamin,
Have you take a look at the slides of my talk in Strata San Jose -
http://www.slideshare.net/databricks/taking-spark-streaming-to-the-next-level-with-datasets-and-dataframes
Unfortunately there is not video, as Strata does not upload videos for
everyone.
I presented the same talk
Also the point about
"First there is this thing called analog signal processing…. Is that
continuous enough for you? "
I agree that analog signal processing like a sine wave, an AM radio
signal – is truly continuous. However, here we are talking about digital
data which will always be sent as
Hi Arun,
My bet is...https://spark-summit.org/2016 :)
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Thu, Apr 28, 2016 at 1:43 PM, Arun Patel
Hello,
Where in the Spark APIs can I get access to the Hadoop Context instance? I
am trying to implement the Spark equivalent of this
public void reduce(Text key, Iterable values, Context
context)
throws IOException, InterruptedException {
if (record == null) {
many thx Nick
kr
On Thu, Apr 28, 2016 at 8:07 PM, Nick Pentreath
wrote:
> This should work:
>
> scala> val df = Seq((25.0, "foo"), (30.0, "bar")).toDF("age", "name")
> scala> df.withColumn("AgeInt", when(col("age") > 29.0,
> 1).otherwise(0)).show
> +++--+
This should work:
scala> val df = Seq((25.0, "foo"), (30.0, "bar")).toDF("age", "name")
scala> df.withColumn("AgeInt", when(col("age") > 29.0, 1).otherwise(0)).show
+++--+
| age|name|AgeInt|
+++--+
|25.0| foo| 0|
|30.0| bar| 1|
+++--+
On Thu, 28 Apr 2016 at
HI all
i have a dataFrame with a column ("Age", type double) and i am trying to
create a new
column based on the value of the Age column, using Scala API
this code keeps on complaining
scala> df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else lit(0))
:28: error: type mismatch;
found :
Hi,
I tried to convert a groupByKey operation to aggregateByKey in a hope to
avoid memory and high gc issue when dealing with 200GB of data.
I needed to create a Collection of resulting key-value pairs which
represent all combinations of given key.
My merge fun definition is as follows:
private
If your question is about how the schema is inferred for JSON,
the paragraph 5.1 from this paper
https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
explains it quite well (long story short, Spark tries to find
the most specific type for the field, otherwise it is a
In a commerical (C)EP like say StreamBase, or for example its competitor
Apama, the arrival of an input event **immediately** triggers further
downstream processing.
This is admitadly an asynchronous approach, not a synchronous clock-driven
micro-batch approach like Spark's.
I suppose if one
Thanks Dr. Mich, Jorn,
It's about 150 million rows in the cached dataset. How do I tell if it's
spilling to disk? I didn't really see any logs to that affect.
How do I determine the optimal number of partitions for a given input
dataset? What's too much?
regards,
imran
On Mon, Apr 25, 2016
Hi Imran,
" How do I tell if it's spilling to disk?"
Well that is a very valid question. I do not have a quantitative matrix to
use it to state that out of X GB of data in Spark, Y GB has been spilled to
disk because of the volume of data.
Unlike an RDBMS Spark uses memory ass opposed to shared
>From what I know and what I have played with, jsonFile reads JsonRecords
which are defined as one record per line. Its not always the case that you
can supply the data that way. If you have custom data json data where you
cannot define a record per line, you will have to write your own
I'm using ALS with mllib 1.5.2 in Scala.
I do not have access to the nonnegative flag in trainImplicit.
Which API is it available from?
What happened when you tried to access port 8080 ?
Checking iptables settings is good to do.
At my employer, we use OpenStack clusters daily and don't encounter much
problem - including UI access.
Probably some settings should be tuned.
On Thu, Apr 28, 2016 at 5:03 AM, Dan Dong
It implements CombineInputFormat from Hadoop. isSplittable=false means each
individual file cannot be split. If you only see one partition even with a
large minPartitions, perhaps the total size of files is not big enough.
Those are configurable in Hadoop conf. -Xiangrui
On Tue, Apr 26, 2016,
Interesting.
The phoenix dependency wasn't shown in the classpath of your previous email.
On Thu, Apr 28, 2016 at 4:12 AM, pierre lacave wrote:
> Narrowed down to some version incompatibility with Phoenix 4.7 ,
>
> Including
Next Thursday is Databricks' webinar on Spark 2.0. If you are attending, I bet
many are going to ask when the release will be. Last time they did this, Spark
1.6 came out not too long afterward.
> On Apr 28, 2016, at 5:21 AM, Sean Owen wrote:
>
> I don't know if anyone has
It is reading the files now but throws another error complaining vector
sizes does not match. I saw this error reported on stack trace .
http://stackoverflow.com/questions/30737361/getting-java-lang-illegalargumentexception-requirement-failed-while-calling-spa
Also example given in scala
I don't know if anyone has begun a firm discussion on dates, but there
are >100 open issues and ~10 blockers, so still some work to do before
code freeze, it looks like. My unofficial guess is mid June before
it's all done.
On Thu, Apr 28, 2016 at 12:43 PM, Arun Patel
Can someone explain to me how the new Structured Streaming works in the
upcoming Spark 2.0+? I’m a little hazy how data will be stored and referenced
if it can be queried and/or batch processed directly from streams and if the
data will be append only to or will there be some sort of upsert
Hi, all,
I'm having problem to access the web UI of my Spark cluster. The cluster
is composed of a few virtual machines running on a OpenStack platform. The
VMs are launched from CentOS7.0 server image available from official site.
The Spark itself runs well and master and worker process are all
I don’t.
I believe that there have been a couple of hack-a-thons like one done in
Chicago a few years back using public transportation data.
The first question is what sort of data do you get from the city?
I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y). Or
they
A small request.
Would you mind providing an approximate date of Spark 2.0 release? Is it
early May or Mid May or End of May?
Thanks,
Arun
Narrowed down to some version incompatibility with Phoenix 4.7 ,
Including $SPARK_HOME/lib/phoenix-4.7.0-HBase-1.1-client-spark.jar to
extraClassPath and that trigger the issue above.
I ll have a go at adding the individual dependencies as opposed to this fat
jar and see how it goes.
Thanks
Why would you use JAVA (create a problem and then try to solve it)? Have
you tried using Scala or Python or even R?
Regards,
Gourav
On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran
wrote:
>
> On 26 Apr 2016, at 18:49, Ted Yu wrote:
>
> Looking at
BTW, I have created a JIRA task to follow this issue:
https://issues.apache.org/jira/browse/SPARK-14974
2016-04-28 18:08 GMT+08:00 linxi zeng :
> Hi,
>
> Recently, we often encounter problems using spark sql for inserting data
> into a partition table (ex.: insert
Hbase 2.0 release likely would come after Spark 2.0 release.
There're other features being developed in hbase 2.0
I am not sure when hbase 2.0 would be released.
The refguide is incomplete.
Zhan has assigned the doc JIRA to himself. The documentation would be done
after fixing bugs in
Hi,
Recently, we often encounter problems using spark sql for inserting data
into a partition table (ex.: insert overwrite table $output_table
partition(dt) select xxx from tmp_table).
After the spark job start running on yarn, *the app will create too many
files (ex. 200w+, or even 1000w+),
Thanks Ted,
I am actually using the hadoop free version of spark
(spark-1.5.0-bin-without-hadoop) over hadoop 2.6.1, so could very well be
related indeed.
I have configured spark-env.sh with export
SPARK_DIST_CLASSPATH=$($HADOOP_PREFIX/bin/hadoop classpath), which is the
only version of hadoop
Hi
I am newbie in this analyzing field. It seems there are exist many
tools, frameworks, ecosystems, softwares, languages and so on.
1) Are there exist some classifications or groupings for them ?
2) What kind of types of tools are exist ?
3) What are the main purposes ot tools ?
Regards
Do you know any good examples how to use Spark streaming in tracking
public transportation systems ?
Or Storm or some other tool example ?
Regards
Esa Heikkinen
28.4.2016, 3:16, Michael Segel kirjoitti:
Uhm…
I think you need to clarify a couple of things…
First there is this thing called
Hi,
I wrote a spark job which registers a temp table
and when I expose it via beeline (JDBC client)
$ *./bin/beeline*
beeline>
* !connect jdbc:hive2://IP:10003 -n ram -p *0: jdbc:hive2://IP>
*show
We are hitting the same issue on Spark 1.6.1 with tungsten enabled, kryo
enabled & sort based shuffle.
Did you find a resolution?
On Sat, Apr 9, 2016 at 6:31 AM, Ted Yu wrote:
> Not much.
>
> So no chance of different snappy version ?
>
> On Fri, Apr 8, 2016 at 1:26 PM,
Are you able to connect to Name node UI on MACHINE_IP:50070.
Check what is URI there.
If UI does't open it means your hdfs is not up ,try to start it using
start.dfs.sh.
On Thu, Apr 28, 2016 at 2:59 AM, Bibudh Lahiri
wrote:
> Hi,
> I installed Hadoop 2.6.0 today on
44 matches
Mail list logo