streaming of binary files in PySpark

2017-05-22 Thread Yogesh Vyas
Hi, I want to use Spark Streaming to read the binary files from HDFS. In the documentation, it is mentioned to use binaryRecordStream(directory, recordLength). But I didn't understand what does the record length means?? Does it means the size of the binary file or something else? Regards,

Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-22 Thread kant kodali
Well there are few things here. 1. What is the Spark Version? 2. You said there is OOM error but what is the cause that appears in the log message or stack trace? OOM can happen for various reasons and JVM usually specifies the cause in the error message. 3. What is the driver and executor

Re: Bizarre UI Behavior after migration

2017-05-22 Thread Miles Crawford
Well, what's happening here is that jobs become "un-finished" - they complete, and then later on pop back into the "Active" section showing a small number of complete/inprogress tasks. In my screenshot, Job #1 completed as normal, and then later on switched back to active with only 92 tasks... it

Re: Spark Streaming: Custom Receiver OOM consistently

2017-05-22 Thread Manish Malhotra
thanks Alonso, Sorry, but there are some security reservations. But we can assume the receiver, is equivalent to writing a JMS based custom receiver, where we register a message listener and for each message delivered by JMS will be stored by calling store method of listener. Something like :

Re: Convert camelCase to snake_case when saving Dataframe/Dataset to parquet?

2017-05-22 Thread Mike Wheeler
Cool. Thanks a lot in advance. On Mon, May 22, 2017 at 2:12 PM, Bryan Jeffrey wrote: > Mike, > > I have code to do that. I'll share it tomorrow. > > Get Outlook for Android > > > > > On Mon, May 22, 2017 at 4:53 PM -0400, "Mike Wheeler" < >

Re: Convert camelCase to snake_case when saving Dataframe/Dataset to parquet?

2017-05-22 Thread Bryan Jeffrey
Mike, I have code to do that. I'll share it tomorrow. Get Outlook for Android On Mon, May 22, 2017 at 4:53 PM -0400, "Mike Wheeler" wrote: Hi Spark User, For Scala case class, we usually use camelCase (carType) for member fields. However,

Convert camelCase to snake_case when saving Dataframe/Dataset to parquet?

2017-05-22 Thread Mike Wheeler
Hi Spark User, For Scala case class, we usually use camelCase (carType) for member fields. However, many data system use snake_case (car_type) for column names. When saving a Dataset of case class to parquet, is there any way to automatically convert camelCase to snake_case (carType -> car_type)?

Re: Bizarre UI Behavior after migration

2017-05-22 Thread Vadim Semenov
I believe it shows only the tasks that have actually being executed, if there were tasks with no data, they don't get reported. I might be mistaken, if somebody has a good explanation, would also like to hear. On Fri, May 19, 2017 at 5:45 PM, Miles Crawford wrote: > Hey

Broadcasted Object is empty in executors.

2017-05-22 Thread Pedro Tuero
Hi, I'm using spark 2.1.0 in aws emr. Kryo Serializer. I'm broadcasting a java class : public class NameMatcher { private static final Logger LOG = LoggerFactory.getLogger(NameMatcher.class); private final Splitter splitter; private final SetMultimap itemsByWord;

Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread kant kodali
HI Burak, My response is inline. Thanks a lot! On Mon, May 22, 2017 at 9:26 AM, Burak Yavuz wrote: > Hi Kant, > >> >> >> 1. Can we use Spark Structured Streaming for stateless transformations >> just like we would do with DStreams or Spark Structured Streaming is only >>

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-22 Thread Michael Armbrust
There is an RC here. Please test! http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html On Fri, May 19, 2017 at 4:07 PM, kant kodali wrote: > Hi Patrick, > > I am using 2.1.1 and I tried the above code you sent and I get > >

Re: Are tachyon and akka removed from 2.1.1 please

2017-05-22 Thread vincent gromakowski
Akka has been replaced by netty in 1.6 Le 22 mai 2017 15:25, "Chin Wei Low" a écrit : > I think akka has been removed since 2.0. > > On 22 May 2017 10:19 pm, "Gene Pang" wrote: > >> Hi, >> >> Tachyon has been renamed to Alluxio. Here is the

Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread Burak Yavuz
Hi Kant, > > > 1. Can we use Spark Structured Streaming for stateless transformations > just like we would do with DStreams or Spark Structured Streaming is only > meant for stateful computations? > Of course you can do stateless transformations. Any map, filter, select, type of transformation

Re: Rmse recomender system

2017-05-22 Thread Chen, Mingrui
Hi, Try to use the most popular "recall, precision and F-score" as evaluation metrics for your recommendation system. Improving prediction performance depends on how good the features you use and whether you choose a proper model. It's hard to tell without any more details.

Re: KMeans Clustering is not Reproducible

2017-05-22 Thread Anastasios Zouzias
Hi Christoph, Take a look at this, you might end up having a similar case: http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/ If this is not the case, then I agree with you the kmeans should be partitioning agnostic (although I haven't check the code yet). Best,

Re: Are tachyon and akka removed from 2.1.1 please

2017-05-22 Thread Chin Wei Low
I think akka has been removed since 2.0. On 22 May 2017 10:19 pm, "Gene Pang" wrote: > Hi, > > Tachyon has been renamed to Alluxio. Here is the documentation for > running Alluxio with Spark > . > > Hope

Re: Are tachyon and akka removed from 2.1.1 please

2017-05-22 Thread Gene Pang
Hi, Tachyon has been renamed to Alluxio. Here is the documentation for running Alluxio with Spark . Hope this helps, Gene On Sun, May 21, 2017 at 6:15 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > HI all, > Iread some paper about

KMeans Clustering is not Reproducible

2017-05-22 Thread Christoph Bruecke
Hi, I’m trying to figure out how to use KMeans in order to achieve reproducible results. I have found that running the same kmeans instance on the same data, with different partitioning will produce different clusterings. Given a simple KMeans run with fixed seed returns different results on

Spark on Mesos failure, when launching a simple job

2017-05-22 Thread ved_kpl
I have been trying to learn spark on mesos, but the spark-shell just keeps on ignoring the offers. Here is my setup: All the components are in the same subnet - 1 mesos master on EC2 instance (t2.micro) command: `mesos-master --work_dir=/tmp/abc --hostname=` - 2 mesos agents (each with 4

unsubscribe

2017-05-22 Thread 信息安全部
unsubscribe