Re: Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Diego Fanesi
that variable "x" would be a DataFrame which is an alias of Dataset in the last versions. you can do your map operation by doing x.map(case Row(f1:String, f2:Int, ) => [your code]). f1 and f2 stands for the columns of your dataset with the type. in the code you can use f1 and f2 as variables

Re: Best way to deal with skewed partition sizes

2017-03-22 Thread Ryan
could you give the event timeline and dag for the time consuming stages on spark UI? On Thu, Mar 23, 2017 at 4:30 AM, Matt Deaver wrote: > For various reasons, our data set is partitioned in Spark by customer id > and saved to S3. When trying to read this data, however,

Re: Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Diego Fanesi
You are using spark as a library but it is much more than that. The book "learning Spark" is very well done and it helped me a lot starting with spark. Maybe you should start from there. Those are the issues in your code: Basically, you generally don't execute spark code like that. You could

Re: Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Keith Chapman
Thanks for the advice Diego, that was very helpful. How could I read the csv as a dataset though? I need to do a map operation over the dataset, I just coded up an example to illustrate the issue On Mar 22, 2017 6:43 PM, "Diego Fanesi" wrote: > You are using spark as a

Converting dataframe to dataset question

2017-03-22 Thread shyla deshpande
Why userDS is Dataset[Any], instead of Dataset[Teamuser]? Appreciate your help. Thanks val spark = SparkSession .builder .config("spark.cassandra.connection.host", cassandrahost) .appName(getClass.getSimpleName) .getOrCreate() import spark.implicits._ val

Re: calculate diff of value and median in a group

2017-03-22 Thread ayan guha
For median, use percentile_approx with 0.5 (50th percentile is the median) On Thu, Mar 23, 2017 at 11:01 AM, Yong Zhang wrote: > He is looking for median, not mean/avg. > > > You have to implement the median logic by yourself, as there is no > directly implementation from

Re: calculate diff of value and median in a group

2017-03-22 Thread Yong Zhang
He is looking for median, not mean/avg. You have to implement the median logic by yourself, as there is no directly implementation from Spark. You can use RDD API, if you are using 1.6.x, or dataset if 2.x The following example gives you an idea how to calculate the median using dataset

Re: calculate diff of value and median in a group

2017-03-22 Thread ayan guha
I would suggest use window function with partitioning. select group1,group2,name,value, avg(value) over (partition group1,group2 order by name) m from t On Thu, Mar 23, 2017 at 9:58 AM, Craig Ching wrote: > Are the elements count big per group? If not, you can group them

Re: calculate diff of value and median in a group

2017-03-22 Thread Craig Ching
> Are the elements count big per group? If not, you can group them and use the > code to calculate the median and diff. > > > They're not big, no. Any pointers on how I might do that? The part I'm having trouble with is the grouping, I can't seem to see how to do the median per group. For

Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Keith Chapman
Hi, I'm trying to read in a CSV file into a Dataset but keep having compilation issues. I'm using spark 2.1 and the following is a small program that exhibit the issue I'm having. I've searched around but not found a solution that worked, I've added "import sqlContext.implicits._" as suggested

Re: calculate diff of value and median in a group

2017-03-22 Thread Yong Zhang
Are the elements count big per group? If not, you can group them and use the code to calculate the median and diff. Yong From: Craig Ching Sent: Wednesday, March 22, 2017 3:17 PM To: user@spark.apache.org Subject: calculate diff of value

Re: Custom Spark data source in Java

2017-03-22 Thread Jörn Franke
ok, I understand. For 1) As a minimum you need to implement inferSchema and buildReader. InferSchema must return the Schema of a row. For example, if it contains one column of type String it returns: StructType(collection.immutable.Seq(StructField("column1", StringType, true)) buildreader: here

Re: Custom Spark data source in Java

2017-03-22 Thread Jean Georges Perrin
Thanks Jörn, I tried to super simplify my project so I can focus on the plumbing and I will add the existing code & library later. So, as of now, the project will not have a lot of meaning but will allow me to understand the job. my call is: String filename = "src/test/resources/simple.json";

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Cody Koeninger
If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing the same message multiple times in a failure situation, keep an eye on

Best way to deal with skewed partition sizes

2017-03-22 Thread Matt Deaver
For various reasons, our data set is partitioned in Spark by customer id and saved to S3. When trying to read this data, however, the larger partitions make it difficult to parallelize jobs. For example, out of a couple thousand companies, some have <10 MB data while some have >10GB. This is the

Re: Custom Spark data source in Java

2017-03-22 Thread Jörn Franke
I think you can develop a Spark data source in Java, but you are right most use for the glue Spark even if they have a Java library (this is what I did for the project I open sourced). Coming back to your question, it is a little bit difficult to assess the exact issue without the code. You

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Matt Deaver
You have to handle de-duplication upstream or downstream. It might technically be possible to handle this in Spark but you'll probably have a better time handling duplicates in the service that reads from Kafka. On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart wrote: >

Spark streaming to kafka exactly once

2017-03-22 Thread Maurin Lenglart
Hi, we are trying to build a spark streaming solution that subscribe and push to kafka. But we are running into the problem of duplicates events. Right now, I am doing a “forEachRdd” and loop over the message of each partition and send those message to kafka. Is there any good way of solving

Custom Spark data source in Java

2017-03-22 Thread Jean Georges Perrin
Hi, I am trying to build a custom file data source for Spark, in Java. I have found numerous examples in Scala (including the CSV and XML data sources from Databricks), but I cannot bring Scala in this project. We also already have the parser itself written in Java, I just need to build the

calculate diff of value and median in a group

2017-03-22 Thread Craig Ching
Hi, When using pyspark, I'd like to be able to calculate the difference between grouped values and their median for the group. Is this possible? Here is some code I hacked up that does what I want except that it calculates the grouped diff from mean. Also, please feel free to comment on how I

[ Spark Streaming & Kafka 0.10 ] Possible bug

2017-03-22 Thread Afshartous, Nick
Hi, I think I'm seeing a bug in the context of upgrading to using the Kafka 0.10 streaming API. Code fragments follow. -- Nick JavaInputDStream> rawStream = getDirectKafkaStream(); JavaDStream> messagesTuple =

Re: [Spark Streaming+Kafka][How-to]

2017-03-22 Thread Cody Koeninger
Glad you got it worked out. That's cool as long as your use case doesn't actually require e.g. partition 0 to always be scheduled to the same executor across different batches. On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami wrote: > So it worked quite well with a

Re: Local spark context on an executor

2017-03-22 Thread Shashank Mandil
Sqoop doesn't work on sharded database. Thanks, Shashank On Wed, Mar 22, 2017 at 5:43 AM Reynier González Tejeda wrote: > Why are you using spark instead of sqoop? > > 2017-03-21 21:29 GMT-03:00 ayan guha : > > For JDBC to work, you can start

Re: Local spark context on an executor

2017-03-22 Thread Reynier González Tejeda
Why are you using spark instead of sqoop? 2017-03-21 21:29 GMT-03:00 ayan guha : > For JDBC to work, you can start spark-submit with appropriate jdbc driver > jars (using --jars), then you will have the driver available on executors. > > For acquiring connections, create a

Re: kafka and spark integration

2017-03-22 Thread Didac Gil
Spark can be a consumer and a producer from the Kafka point of view. You can create a kafka client in Spark that registers to a topic and reads the feeds, and you can process data in Spark and generate a producer that sends that data into a topic. So, Spark lies next to Kafka and you can use

kafka and spark integration

2017-03-22 Thread Adaryl Wakefield
I'm a little confused on how to use Kafka and Spark together. Where exactly does Spark lie in the architecture? Does it sit on the other side of the Kafka producer? Does it feed the consumer? Does it pull from the consumer? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC