Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
storage handler bulk load: SET hive.hbase.bulk=true; INSERT OVERWRITE TABLE users SELECT … ; But for now, you have to do some work and issue multiple Hive commands Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and

Re: spark architecture question -- Pleas Read

2017-01-27 Thread Russell Spitzer
You can treat Oracle as a JDBC source ( http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) and skip Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the way back out (see the same link) and write directly to Oracle. I'll leave the

Re: spark architecture question -- Pleas Read

2017-01-27 Thread Sirisha Cheruvu
On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu wrote: > Hi Team, > > RIght now our existing flow is > > Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive > Context)-->Destination Hive table -->sqoop export to Oracle > > Half of the Hive UDFS required is developed

spark architecture question -- Pleas Read

2017-01-27 Thread Sirisha Cheruvu
Hi Team, RIght now our existing flow is Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive Context)-->Destination Hive table -->sqoop export to Oracle Half of the Hive UDFS required is developed in Java UDF.. SO Now I want to know if I run the native scala UDF's than runninng hive java

CFP for Spark Summit San Francisco closes on Feb. 6

2017-01-27 Thread Scott walent
In June, the 10th Spark Summit will take place in San Francisco at Moscone West. We have expanded our CFP to include more topics and deep-dive technical sessions. Take center stage in front of your fellow Spark enthusiasts. Submit your presentation and join us for the big ten. The CFP closes on

RE: spark 2.02 error when writing to s3

2017-01-27 Thread VND Tremblay, Paul
Not sure what you mean by "a consistency layer on top." Any explanation would be greatly appreciated! Paul _ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile +

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Michael Armbrust
Yeah, kafka server client compatibility can be pretty confusing and does not give good errors in the case of mismatches. This should be addressed in the next release of kafka (they are adding an API to query the servers capabilities). On Fri, Jan 27, 2017 at 12:56 PM, Koert Kuipers

Re: Dynamic resource allocation to Spark on Mesos

2017-01-27 Thread Mihai Iacob
What about Spark on Kubernetes, is there a way to manage dynamic resource allocation?   Regards,  Mihai Iacob 

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Koert Kuipers
in case anyone else runs into this: the issue is that i was using kafka-clients 0.10.1.1 it works when i use kafka-clients 0.10.0.1 with spark structured streaming my kafka server is 0.10.1.1 On Fri, Jan 27, 2017 at 1:24 PM, Koert Kuipers wrote: > i checked my topic. it

Re: Issue creating row with java.util.Map type

2017-01-27 Thread Ankur Srivastava
Thank you Richard for responding. I am able to run it successfully by using row.getMap but since I have to update the map I wanted to use the HashMap api. Is there a way I can use that? And I am surprised it worked in first case where I am creating Dataset from list of rows but fails in the Map

Re: Issue creating row with java.util.Map type

2017-01-27 Thread Richard Xin
try Row newRow = RowFactory.create(row.getString(0), row.getString(1), row.getMap(2)); On Friday, January 27, 2017 10:52 AM, Ankur Srivastava wrote: + DEV Mailing List On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava wrote:

Re: Dynamic resource allocation to Spark on Mesos

2017-01-27 Thread Michael Gummelt
> The way I understand is that the Spark job will not run if the CPU/Mem requirement is not met. Spark jobs will still run if they only have a subset of the requested resources. Tasks begin scheduling as soon as the first executor comes up. Dynamic allocation yields increased utilization by only

Re: Issue creating row with java.util.Map type

2017-01-27 Thread Ankur Srivastava
+ DEV Mailing List On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi, > > I am trying to map a Dataset with rows which have a map attribute. When I > try to create a Row with the map attribute I get cast errors. I am able to > reproduce the issue with

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Shixiong(Ryan) Zhu
Thanks for reporting this. Which Spark version are you using? Could you provide the full log, please? On Fri, Jan 27, 2017 at 10:24 AM, Koert Kuipers wrote: > i checked my topic. it has 5 partitions but all the data is written to a > single partition: wikipedia-2 > i turned

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Koert Kuipers
i checked my topic. it has 5 partitions but all the data is written to a single partition: wikipedia-2 i turned on debug logging and i see this: 2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Partitions assigned to consumer: [wikipedia-0, wikipedia-4, wikipedia-3, wikipedia-2, wikipedia-1].

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Koert Kuipers
code: val query = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "somenode:9092") .option("subscribe", "wikipedia") .load .select(col("value") cast StringType) .writeStream .format("console")

Converting timezones in Spark

2017-01-27 Thread Don Drake
I'm reading CSV with a timestamp clearly identified in the UTC timezone, and I need to store this in a parquet format and eventually read it back and convert to different timezones as needed. Sounds straightforward, but this involves some crazy function calls and I'm seeing strange results as I

Dynamic resource allocation to Spark on Mesos

2017-01-27 Thread Ji Yan
Dear Spark Users, Currently is there a way to dynamically allocate resources to Spark on Mesos? Within Spark we can specify the CPU cores, memory before running job. The way I understand is that the Spark job will not run if the CPU/Mem requirement is not met. This may lead to decrease in overall

Re: Making withColumn nullable

2017-01-27 Thread Koert Kuipers
it should be by default nullable except for certain primitives where it defaults to non-nullable you can use Option for your return value to indicate nullability. On Fri, Jan 27, 2017 at 10:32 AM, Ninad Shringarpure wrote: > HI Team, > > When I add a column to my data frame

Issue with caching

2017-01-27 Thread Anil Langote
Hi All I am trying to cache large dataset with storage level memory and sterilization with kyro enabled when I run my spark job multiple times I get different performance at a times caching dataset spark hangs and takes forever what is wrong. The best time I got is 20 mins and some times

Making withColumn nullable

2017-01-27 Thread Ninad Shringarpure
HI Team, When I add a column to my data frame using withColumn and assign some value, it automatically creates the schema with this column to be not nullable. My final Hive table schema where I want to insert it has this column to be nullable and hence throws an error when I try to save. Is

Re: Text

2017-01-27 Thread Jörn Franke
Sorry the message was not complete: the key is the file position, so if you sort by key the lines will be in the same order as in the original file > On 27 Jan 2017, at 14:45, Jörn Franke wrote: > > I agree with the previous statements. You cannot expect any ordering >

Re: Text

2017-01-27 Thread Jörn Franke
I agree with the previous statements. You cannot expect any ordering guarantee. This means you need to ensure that the same ordering is done as the original file. Internally Spark is using the Hadoop Client libraries - even if you do not have Hadoop installed, because it is a flexible

Re: Text

2017-01-27 Thread ayan guha
I would not count on order preserving nature of the operations, because it is not guranteed. I would assign some order to the sentences and sort at the end before write back On Fri, 27 Jan 2017 at 10:59 pm, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Some operations like map,

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Graham Clark
Hi - thanks for the responses. You are right that I started by copying the word-counting example. I assumed that this would help spread the load evenly across the cluster, with each worker receiving a portion of the stream data - corresponding to one shard's worth - and then keeping the data local

Re: Text

2017-01-27 Thread Md. Rezaul Karim
Some operations like map, filter, flatMap and coalesce (with shuffle=false) usually preserve the order. However, sortBy, reduceBy, partitionBy, join etc. do not. Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National

Re: spark 2.02 error when writing to s3

2017-01-27 Thread Steve Loughran
OK Nobody should be committing output directly to S3 without having something add a consistency layer on top, not if you want reliabie (as in "doesn't lose/corrupt data" reliable) work On 26 Jan 2017, at 19:09, VND Tremblay, Paul > wrote:

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Takeshi Yamamuro
Probably, he referred to the word-couting example in kinesis here: https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L114 On Fri, Jan 27, 2017 at 6:41 PM, ayan guha wrote: > Maybe a

Text

2017-01-27 Thread Soheila S.
Hi All, I read a test file using sparkContext.textfile(filename) and assign it to an RDD and process the RDD (replace some words) and finally write it to a text file using rdd.saveAsTextFile(output). Is there any way to be sure the order of the sentences will not be changed? I need to have the

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread ayan guha
Maybe a naive question: why are you creating 1 Dstream per shard? It should be one Dstream corresponding to kinesis stream, isn't it? On Fri, Jan 27, 2017 at 8:09 PM, Takeshi Yamamuro wrote: > Hi, > > Just a guess though, Kinesis shards sometimes have skew data. > So,

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Takeshi Yamamuro
Hi, Just a guess though, Kinesis shards sometimes have skew data. So, before you compute something from kinesis RDDs, you'd be better to repartition them for better parallelism. // maropu On Fri, Jan 27, 2017 at 2:54 PM, Graham Clark wrote: > Hi everyone - I am building a

Re: spark intermediate data fills up the disk

2017-01-27 Thread Takeshi Yamamuro
IIUC, if the references of RDDs have gone, the related files (e.g., shuffled data) of these RDDs are automatically removed by `ContextCleaner` ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L178 ). Since spark can recompute from datasources

Re: outdated documentation? SparkSession

2017-01-27 Thread Chetan Khatri
Not outdated at all, because there are other methods having dependencies on sparkcontext so you have to create it. For example, https://gist.github.com/chetkhatri/f75c2b743e6cb2d7066188687448c5a1 On Fri, Jan 27, 2017 at 2:06 PM, Wojciech Indyk wrote: > Hi! > In this doc

outdated documentation? SparkSession

2017-01-27 Thread Wojciech Indyk
Hi! In this doc http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark initialization is described by SparkContext. Do you think is it reasonable to change it to SparkSession or just mentioned it at the end? I can prepare it and make PR for this, but want to know your