Re: Failed to connect to master ...

2017-03-09 Thread ??????????
if you want debug app with remote cluster, you should submit the jar in cmd with java debug option and the start the ideal to connect to the cluster. ---Original--- From: "Shixiong(Ryan) Zhu" Date: 2017/3/8 15:38:35 To: "Mina Aslani"; Cc:

ml package data types

2017-03-09 Thread jinhong lu
Hi, Is there any documentation for ml package data types? just like the mllib package here : https://spark.apache.org/docs/latest/mllib-data-types.html Or it is the same for ml and mllib? Thanks, lujinhong

Re: Spark failing while persisting sorted columns.

2017-03-09 Thread Yong Zhang
My guess is that your executor already crashed, due to OOM?. You should check the executor log, it may tell you more information. Yong From: Rohit Verma Sent: Thursday, March 9, 2017 4:41 AM To: user Subject: Spark failing while

PickleException when collecting DataFrame containing empty bytearray()

2017-03-09 Thread tot0
Hi All, I tried creating a DataFrame containing an empty bytearray, bytearray(b''). The DataFrame can be created but upon collection the following Exception occurs, repro steps are also at the top of the pyspark shell: Bytearray's that aren't empty cause no issues: This doesn't seem right

PickleException when collecting DataFrame containing empty bytearray()

2017-03-09 Thread tot0
Hi All, I tried creating a DataFrame containing an empty bytearray, bytearray(b''). The DataFrame can be created but upon collection the following Exception occurs, repro steps are also at the top of the pyspark shell: pyspark gist

Distinct for Avro Key/Value PairRDD

2017-03-09 Thread Alex Sulimanov
Good day everyone! Have you tried to de-duplicated records based on Avro generated classes? These classes extend SpecificRecord which has equals and hashCode implementation, although when i try to use .distinct on my PairRDD (both key and value are Avro classes), it eliminates records which

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-09 Thread Jörn Franke
I find this question strange. There is no best tool for every use case. For example, both tools mentioned below are suitable for different purposes, sometimes also complementary. > On 9 Mar 2017, at 20:37, Gaurav1809 wrote: > > Hi All, Would you please let me know

Which streaming platform is best? Kafka or Spark Streaming?

2017-03-09 Thread Gaurav1809
Hi All, Would you please let me know which streaming platform is best. Be it server log processing, social media feeds ot any such streaming data. I want to know the comparison between Kafka & Spark Streaming. -- View this message in context:

How does preprocessing fit into Spark MLlib pipeline

2017-03-09 Thread aATv
I want to start using PySpark Mllib pipelines, but I don't understand how/where preprocessing fits into the pipeline. My preprocessing steps are generally in the following form: 1) Load log files(from s3) and parse into a spark Dataframe with columns user_id, event_type, timestamp, etc 2)

Question on Spark's graph libraries

2017-03-09 Thread enzo
I am a bit confused by the current roadmap for graph and graph analytics in Apache Spark. I understand that we have had for some time two libraries (the following is my understanding - please amend as appropriate!): . GraphX, part of Spark project. This library is based on RDD and it is only

Spark Jobs filling up the disk at SPARK_LOCAL_DIRS location

2017-03-09 Thread kant kodali
Hi All, My spark streaming jobs are filling up the disk within a short amount of time < 10 mins. I have a disk space of 10GB and it is getting full at SPARK_LOCAL_DIRS location. In my case SPARK_LOCAL_DIRS is set to /usr/local/spark/temp. There are lot of files like this input-0-1489072623600

Re: spark-sql use case beginner question

2017-03-09 Thread Subhash Sriram
We have a similar use case. We use the DataFrame API to cache data out of Hive tables, and then run pretty complex scripts on them. You can register your Hive UDFs to be used within Spark SQL statements if you want. Something like this: sqlContext.sql("CREATE TEMPORARY FUNCTION as ''") If you

Re: Apparent memory leak involving count

2017-03-09 Thread Sean Owen
The driver keeps metrics on everything that has executed. This is how it can display the history in the UI. It's normal for the bookkeeping to keep growing because it's recording every job. You can configure it to keep records about fewer jobs. But thousands of entries isn't exactly big. On Thu,

Re: Apparent memory leak involving count

2017-03-09 Thread Jörn Franke
You seem to generate always a new rdd instead of reusing the existing. So I does not seem surprising that the memory need is growing. > On 9 Mar 2017, at 15:24, Facundo Domínguez wrote: > > Hello, > > Some heap profiling shows that memory grows under a TaskMetrics

Re: Apparent memory leak involving count

2017-03-09 Thread Facundo Domínguez
Hello, Some heap profiling shows that memory grows under a TaskMetrics class. Thousands of live hashmap entries are accumulated. Would it be possible to disable collection of metrics? I've been looking for settings to disable it but nothing relevant seems to come up. Thanks, Facundo On Wed, Mar

Re:

2017-03-09 Thread Marco Mistroni
Try to remove the Kafka code as it seems Kafka is not the issue. Here. Create a DS and save to Cassandra and see what happensEven in the console That should give u a starting point? Hth On 9 Mar 2017 3:07 am, "sathyanarayanan mudhaliyar" < sathyanarayananmudhali...@gmail.com> wrote:

[no subject]

2017-03-09 Thread sathyanarayanan mudhaliyar
I am using spark streaming for a basic streaming movie count program. So I first I have mapped the year and movie name to a JavaPairRDD and I am using the reduceByKey cor counting the movie year wise. I am using cassandra for output, the spark streaming application is not stopping and the

RE: Huge partitioning job takes longer to close after all tasks finished

2017-03-09 Thread PSwain
Hi Swapnil, We are facing same issue , could you please let me know how did you find that partitions are getting merged ? Thanks in advance !! From: Swapnil Shinde [mailto:swapnilushi...@gmail.com] Sent: Thursday, March 09, 2017 1:31 AM To: cht liu Cc:

Spark failing while persisting sorted columns.

2017-03-09 Thread Rohit Verma
Hi all, Please help me with below scenario. While writing below query on large dataset (rowCount=100,000,000) using below query // there are other instance of below job submitting to spark in multithreaded app. final Dataset df = spark.read().parquet(tablePath); // df storage is hdfs is 5.64

Re: Huge partitioning job takes longer to close after all tasks finished

2017-03-09 Thread Gourav Sengupta
Hi, you are definitely not using SPARK 2.1 in the way it should be used. Try using sessions, and follow their guidelines, this issue has been specifically resolved as a part of Spark 2.1 release. Regards, Gourav On Wed, Mar 8, 2017 at 8:00 PM, Swapnil Shinde wrote: