storage handler bulk load:
SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE users SELECT … ;
But for now, you have to do some work and issue multiple Hive commands
Sample source data for range partitioning
Save sampling results to a file
Run CLUSTER BY query using HiveHFileOutputFormat and
You can treat Oracle as a JDBC source (
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
and skip Sqoop, HiveTables and go straight to Queries. Then you can skip
hive on the way back out (see the same link) and write directly to Oracle.
I'll leave the
On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu wrote:
> Hi Team,
>
> RIght now our existing flow is
>
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
> Context)-->Destination Hive table -->sqoop export to Oracle
>
> Half of the Hive UDFS required is developed
Hi Team,
RIght now our existing flow is
Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
Context)-->Destination Hive table -->sqoop export to Oracle
Half of the Hive UDFS required is developed in Java UDF..
SO Now I want to know if I run the native scala UDF's than runninng hive
java
In June, the 10th Spark Summit will take place in San Francisco at Moscone
West. We have expanded our CFP to include more topics and deep-dive
technical sessions.
Take center stage in front of your fellow Spark enthusiasts. Submit your
presentation and join us for the big ten. The CFP closes on
Not sure what you mean by "a consistency layer on top." Any explanation would
be greatly appreciated!
Paul
_
Paul Tremblay
Analytics Specialist
THE BOSTON CONSULTING GROUP
Tel. + ▪ Mobile +
Yeah, kafka server client compatibility can be pretty confusing and does
not give good errors in the case of mismatches. This should be addressed
in the next release of kafka (they are adding an API to query the servers
capabilities).
On Fri, Jan 27, 2017 at 12:56 PM, Koert Kuipers
What about Spark on Kubernetes, is there a way to manage dynamic resource allocation?
Regards,
Mihai Iacob
in case anyone else runs into this:
the issue is that i was using kafka-clients 0.10.1.1
it works when i use kafka-clients 0.10.0.1 with spark structured streaming
my kafka server is 0.10.1.1
On Fri, Jan 27, 2017 at 1:24 PM, Koert Kuipers wrote:
> i checked my topic. it
Thank you Richard for responding.
I am able to run it successfully by using row.getMap but since I have to
update the map I wanted to use the HashMap api. Is there a way I can use
that? And I am surprised it worked in first case where I am creating
Dataset from list of rows but fails in the Map
try
Row newRow = RowFactory.create(row.getString(0), row.getString(1),
row.getMap(2));
On Friday, January 27, 2017 10:52 AM, Ankur Srivastava
wrote:
+ DEV Mailing List
On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava
wrote:
> The way I understand is that the Spark job will not run if the CPU/Mem
requirement is not met.
Spark jobs will still run if they only have a subset of the requested
resources. Tasks begin scheduling as soon as the first executor comes up.
Dynamic allocation yields increased utilization by only
+ DEV Mailing List
On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:
> Hi,
>
> I am trying to map a Dataset with rows which have a map attribute. When I
> try to create a Row with the map attribute I get cast errors. I am able to
> reproduce the issue with
Thanks for reporting this. Which Spark version are you using? Could you
provide the full log, please?
On Fri, Jan 27, 2017 at 10:24 AM, Koert Kuipers wrote:
> i checked my topic. it has 5 partitions but all the data is written to a
> single partition: wikipedia-2
> i turned
i checked my topic. it has 5 partitions but all the data is written to a
single partition: wikipedia-2
i turned on debug logging and i see this:
2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Partitions assigned to
consumer: [wikipedia-0, wikipedia-4, wikipedia-3, wikipedia-2,
wikipedia-1].
code:
val query = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "somenode:9092")
.option("subscribe", "wikipedia")
.load
.select(col("value") cast StringType)
.writeStream
.format("console")
I'm reading CSV with a timestamp clearly identified in the UTC timezone,
and I need to store this in a parquet format and eventually read it back
and convert to different timezones as needed.
Sounds straightforward, but this involves some crazy function calls and I'm
seeing strange results as I
Dear Spark Users,
Currently is there a way to dynamically allocate resources to Spark on
Mesos? Within Spark we can specify the CPU cores, memory before running
job. The way I understand is that the Spark job will not run if the CPU/Mem
requirement is not met. This may lead to decrease in overall
it should be by default nullable except for certain primitives where it
defaults to non-nullable
you can use Option for your return value to indicate nullability.
On Fri, Jan 27, 2017 at 10:32 AM, Ninad Shringarpure
wrote:
> HI Team,
>
> When I add a column to my data frame
Hi All
I am trying to cache large dataset with storage level memory and sterilization
with kyro enabled when I run my spark job multiple times I get different
performance at a times caching dataset spark hangs and takes forever what is
wrong.
The best time I got is 20 mins and some times
HI Team,
When I add a column to my data frame using withColumn and assign some
value, it automatically creates the schema with this column to be not
nullable.
My final Hive table schema where I want to insert it has this column to be
nullable and hence throws an error when I try to save.
Is
Sorry the message was not complete: the key is the file position, so if you
sort by key the lines will be in the same order as in the original file
> On 27 Jan 2017, at 14:45, Jörn Franke wrote:
>
> I agree with the previous statements. You cannot expect any ordering
>
I agree with the previous statements. You cannot expect any ordering guarantee.
This means you need to ensure that the same ordering is done as the original
file. Internally Spark is using the Hadoop Client libraries - even if you do
not have Hadoop installed, because it is a flexible
I would not count on order preserving nature of the operations, because it
is not guranteed. I would assign some order to the sentences and sort at
the end before write back
On Fri, 27 Jan 2017 at 10:59 pm, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:
> Some operations like map,
Hi - thanks for the responses. You are right that I started by copying the
word-counting example. I assumed that this would help spread the load
evenly across the cluster, with each worker receiving a portion of the
stream data - corresponding to one shard's worth - and then keeping the
data local
Some operations like map, filter, flatMap and coalesce (with shuffle=false)
usually preserve the order. However, sortBy, reduceBy, partitionBy, join
etc. do not.
Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National
OK
Nobody should be committing output directly to S3 without having something add
a consistency layer on top, not if you want reliabie (as in "doesn't
lose/corrupt data" reliable) work
On 26 Jan 2017, at 19:09, VND Tremblay, Paul
> wrote:
Probably, he referred to the word-couting example in kinesis here:
https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L114
On Fri, Jan 27, 2017 at 6:41 PM, ayan guha wrote:
> Maybe a
Hi All,
I read a test file using sparkContext.textfile(filename) and assign it to
an RDD and process the RDD (replace some words) and finally write it to
a text file using rdd.saveAsTextFile(output).
Is there any way to be sure the order of the sentences will not be changed?
I need to have the
Maybe a naive question: why are you creating 1 Dstream per shard? It should
be one Dstream corresponding to kinesis stream, isn't it?
On Fri, Jan 27, 2017 at 8:09 PM, Takeshi Yamamuro
wrote:
> Hi,
>
> Just a guess though, Kinesis shards sometimes have skew data.
> So,
Hi,
Just a guess though, Kinesis shards sometimes have skew data.
So, before you compute something from kinesis RDDs, you'd be better to
repartition them
for better parallelism.
// maropu
On Fri, Jan 27, 2017 at 2:54 PM, Graham Clark wrote:
> Hi everyone - I am building a
IIUC, if the references of RDDs have gone, the related files (e.g.,
shuffled data) of these
RDDs are automatically removed by `ContextCleaner` (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L178
).
Since spark can recompute from datasources
Not outdated at all, because there are other methods having dependencies on
sparkcontext so you have to create it.
For example,
https://gist.github.com/chetkhatri/f75c2b743e6cb2d7066188687448c5a1
On Fri, Jan 27, 2017 at 2:06 PM, Wojciech Indyk
wrote:
> Hi!
> In this doc
Hi!
In this doc
http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark
initialization is described by SparkContext. Do you think is it reasonable
to change it to SparkSession or just mentioned it at the end? I can prepare
it and make PR for this, but want to know your
34 matches
Mail list logo