Spark Streaming: Doing operation in Receiver vs RDD

2015-10-07 Thread emiretsk
Hi, I have a Spark Streaming program that is consuming message from Kafka and has to decrypt and deserialize each message. I can implement it either as Kafka deserializer (that will run in a receiver or the new receiver-less Kafka consumer) or as RDD operations. What are the pros/cons of each?

RE: Parquet file size

2015-10-07 Thread Younes Naguib
The TSV original files is 600GB and generated 40k files of 15-25MB. y From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: October-07-15 3:18 PM To: Younes Naguib; 'user@spark.apache.org' Subject: Re: Parquet file size Why do you want larger files? Doesn't the result Parquet file contain all

Re: Parquet file size

2015-10-07 Thread Cheng Lian
The reason why so many small files are generated should probably be the fact that you are inserting into a partitioned table with three partition columns. If you want a large Parquet files, you may try to either avoid using partitioned table, or using less partition columns (e.g., only year,

What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread YiZhi Liu
Hi everyone, I'm curious about the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS. Both of them are optimized using LBFGS, the only difference I see is LogisticRegression takes DataFrame while LogisticRegressionWithLBFGS takes RDD. So

Model exports PMML (Random Forest)

2015-10-07 Thread Yasemin Kaya
Hi, I want to export my model to PMML. But there is no development about random forest. It is planned to 1.6 version. Is it possible producing my model (random forest) PMML xml format manuelly? Thanks. Best, yasemin -- hiç ender hiç

Re: does KafkaCluster can be public ?

2015-10-07 Thread Erwan ALLAIN
Thanks guys ! On Wed, Oct 7, 2015 at 1:41 AM, Cody Koeninger wrote: > Sure no prob. > > On Tue, Oct 6, 2015 at 6:35 PM, Tathagata Das wrote: > >> Given the interest, I am also inclining towards making it a public >> developer API. Maybe even

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Sean Owen
These are true, but it's not because Spark is written in Scala; it's because it executes in the JVM. So, Scala/Java-based apps have an advantage in that they don't have to serialize data back and forth to a Python process, which also brings a new set of things that can go wrong. Python is also

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-07 Thread Sun, Rui
Not sure "/C/DevTools/spark-1.5.1/bin/spark-submit.cmd" is a valid? From: Hossein [mailto:fal...@gmail.com] Sent: Wednesday, October 7, 2015 12:46 AM To: Khandeshi, Ami Cc: Sun, Rui; akhandeshi; user@spark.apache.org Subject: Re: SparkR Error in sparkR.init(master=“local”) in RStudio Have you

Re: Graceful shutdown drops processing in Spark Streaming

2015-10-07 Thread Michal Čizmazia
Thanks! Done. https://issues.apache.org/jira/browse/SPARK-10995 On 7 October 2015 at 21:24, Tathagata Das wrote: > Aaah, interesting, you are doing 15 minute slide duration. Yeah, > internally the streaming scheduler waits for the last "batch" interval > which has data to

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? > You are probably seeing the effect of the JVMs JIT. The first run is

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Please find attached. On Wed, Oct 7, 2015 at 7:36 PM, Ted Yu wrote: > Hemant: > Can you post the code snippet to the mailing list - other people would be > interested. > > On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat > wrote: > >> Will send you

Default size of a datatype in SparkSQL

2015-10-07 Thread vivek bhaskar
I want to understand whats use of default size for a given datatype? Following link mention that its for internal size estimation. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DataType.html Above behavior is also reflected in code where default value seems to be used

Re: Parquet file size

2015-10-07 Thread Deng Ching-Mallete
Hi, In our case, we're using the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to increase the size of the RDD partitions when loading text files, so it would generate larger parquet files. We just set it in the Hadoop conf of the SparkContext. You need to be careful though

Re: Graceful shutdown drops processing in Spark Streaming

2015-10-07 Thread Tathagata Das
Aaah, interesting, you are doing 15 minute slide duration. Yeah, internally the streaming scheduler waits for the last "batch" interval which has data to be processed, but if there is a sliding interval (i.e. 15 mins) that is higher than batch interval, then that might not be run. This is indeed a

Re: Asking about the trend of increasing latency, hbase spikes.

2015-10-07 Thread Ted Yu
This question should be directed to user@ Can you use third party site for the images - they didn't go through. On Wed, Oct 7, 2015 at 5:35 PM, UI-JIN LIM wrote: > Hi. This is Ui Jin, Lim in Korea, LG CNS > > > > We had setup and are operating hbase 0.98.13 on our customer,

Running Spark in Yarn-client mode

2015-10-07 Thread Sushrut Ikhar
Hi, I am new to Spark and I have been trying to run Spark in yarn-client mode. I get this error in yarn logs : Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend Also, I keep getting these warnings: WARN YarnScheduler: Initial job has not accepted

Is coalesce smart while merging partitions?

2015-10-07 Thread Cesar Flores
It is my understanding that the default behavior of coalesce function when the user reduce the number of partitions is to only merge them without executing shuffle. My question is: Is this merging smart? For example does spark try to merge the small partitions first or the election of partitions

RE: Parquet file size

2015-10-07 Thread Younes Naguib
Well, I only have data for 2015-08. So, in the end, only 31 partitions What I'm looking for, is some reasonably sized partitions. In any case, just the idea of controlling the output parquet files size or number would be nice. Younes Naguib Streaming Division Triton Digital | 1440

Re: Running Spark in Yarn-client mode

2015-10-07 Thread Jean-Baptiste Onofré
Hi Sushrut, which packaging of Spark do you use ? Do you have a working Yarn cluster (with at least one worker) ? spark-hadoop-x ? Regards JB On 10/08/2015 07:23 AM, Sushrut Ikhar wrote: Hi, I am new to Spark and I have been trying to run Spark in yarn-client mode. I get this error in yarn

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month,

Re: RDD of ImmutableList

2015-10-07 Thread Jakub Dubovsky
I did not realized that scala's and java's immutable collections uses different api which causes this. Thank you for reminder. This makes some sense now... -- Původní zpráva -- Od: Jonathan Coveney Komu: Jakub Dubovsky

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Oh, this is an internal class of our project and I had used it without realizing the source. Anyway, the idea is to wrap the InternalRow in a class that derives from Row. When you implement the functions of the trait 'Row ', the type conversions from Row types to InternalRow types has to be done

ClassCastException while reading data from HDFS through Spark

2015-10-07 Thread Vinoth Sankar
I'm just reading data from HDFS through Spark. It throws *java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.BytesWritable* at line no 6. I never used LongWritable in my code, no idea how the data was in that format. Note : I'm not using

Re: spark multi tenancy

2015-10-07 Thread ayan guha
Can queues also be used to separate workloads? On 7 Oct 2015 20:34, "Steve Loughran" wrote: > > > On 7 Oct 2015, at 09:26, Dominik Fries > wrote: > > > > Hello Folks, > > > > We want to deploy several spark projects and want to use a unique

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-07 Thread Gerard Maas
Thanks for the feedback. Cassandra does not seem to be the issue. The time for writing to Cassandra is in the same order of magnitude (see below) The code structure is roughly as follows: dstream.filter(pred).foreachRDD{rdd => val sparkT0 = currentTimeMs val metrics =

hiveContext sql number of tasks

2015-10-07 Thread patcharee
Hi, I do a sql query on about 10,000 partitioned orc files. Because of the partition schema the files cannot be merged any longer (to reduce the total number). From this command hiveContext.sql(sqlText), the 10K tasks were created to handle each file. Is it possible to use less tasks? How

Re: Notification on Spark Streaming job failure

2015-10-07 Thread Steve Loughran
On 7 Oct 2015, at 06:28, Krzysztof Zarzycki > wrote: Hi Vikram, So you give up using yarn-cluster mode of launching Spark jobs, is that right? AFAIK when using yarn-cluster mode, the launch process (spark-submit) monitors job running on YARN,

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Ophir Cohen
>From which jar WrappedInternalRow comes from? It seems that I can't find it. BTW What I'm trying to do now is to create scala array from the fields and than create Row out of that array. The problem is that I get types mismatches... On Wed, Oct 7, 2015 at 8:03 AM, Hemant Bhanawat

Re: ClassCastException while reading data from HDFS through Spark

2015-10-07 Thread UMESH CHAUDHARY
As per the Exception, it looks like there is a mismatch in actual sequence file's value type and the one which is provided by you in your code. Change BytesWritable to *LongWritable * and feel the execution. -Umesh On Wed, Oct 7, 2015 at 2:41 PM, Vinoth Sankar wrote: >

Re: spark multi tenancy

2015-10-07 Thread Dominik Fries
Currently we try to execute pyspark from user CLI, but in context of project user, but get this error : (the cluster is kerberized) [@edgenode1 ~]$ pyspark --master yarn --num-executors 5 --proxy-user Python 2.7.5 (default, Jun 24 2015, 00:41:19) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Ophir Cohen
Thanks! Can you check if you can provide example of the conversion? On Wed, Oct 7, 2015 at 2:05 PM, Hemant Bhanawat wrote: > Oh, this is an internal class of our project and I had used it without > realizing the source. > > Anyway, the idea is to wrap the InternalRow in

Re: spark multi tenancy

2015-10-07 Thread Steve Loughran
> On 7 Oct 2015, at 09:26, Dominik Fries wrote: > > Hello Folks, > > We want to deploy several spark projects and want to use a unique project > user for each of them. Only the project user should start the spark > application and have the corresponding packages

Re: RDD of ImmutableList

2015-10-07 Thread Sean Owen
I think Java's immutable collections are fine with respect to kryo -- that's not the same as Guava. On Wed, Oct 7, 2015 at 11:56 AM, Jakub Dubovsky wrote: > I did not realized that scala's and java's immutable collections uses > different api which causes this.

What happens in the master or slave launch ?

2015-10-07 Thread Camelia Elena Ciolac
Hello, I have the following question: I have two scenarios: 1) in one scenario (if I'm connected on the target node) the master starts successfully. Its log contains: Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp

Re: question on make multiple external calls within each partition

2015-10-07 Thread Chen Song
Thanks TD and Ashish. On Mon, Oct 5, 2015 at 9:14 PM, Tathagata Das wrote: > You could create a threadpool on demand within the foreachPartitoin > function, then handoff the REST calls to that threadpool, get back the > futures and wait for them to finish. Should be pretty

Re: Notification on Spark Streaming job failure

2015-10-07 Thread Adrian Tanase
We’re deploying using YARN in cluster mode, to take advantage of automatic restart of long running streaming app. We’ve also done a POC on top of Mesos+Marathon, that’s always an option. For monitoring / alerting, we’re using a combination of: * Spark REST API queried from OpsView via

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Ted Yu
Hemant: Can you post the code snippet to the mailing list - other people would be interested. On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat wrote: > Will send you the code on your email id. > > On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen wrote: > >>

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Will send you the code on your email id. On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen wrote: > Thanks! > Can you check if you can provide example of the conversion? > > > On Wed, Oct 7, 2015 at 2:05 PM, Hemant Bhanawat > wrote: > >> Oh, this is an

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-07 Thread Khandeshi, Ami
Tried, multiple permutation of setting home… Still same issue > Sys.setenv(SPARK_HOME="c:\\DevTools\\spark-1.5.1") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths())) > library(SparkR) Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’:

What happens in the master or slave launch ?

2015-10-07 Thread camelia
Hello, I have the following question: I have two scenarios: 1) in one scenario (if I'm connected on the target node) the master starts successfully. Its log contains: Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp

spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Hi All, Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes of data). The RDD is partitioned into 2048 partitions which are more or less equal and entirely cached in RAM. I evaluated the performance on several cluster sizes, and am witnessing a non linear (power)

This post has NOT been accepted by the mailing list yet.

2015-10-07 Thread akhandeshi
I seem to see this for many of my posts... does anyone have solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/This-post-has-NOT-been-accepted-by-the-mailing-list-yet-tp24969.html Sent from the Apache Spark User List mailing list archive at

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-07 Thread Cody Koeninger
When you say that the largest difference is from metrics.collect, how are you measuring that? Wouldn't that be the difference between max(partitionT1) and sparkT1, not sparkT0 and sparkT1? As for further places to look, what's happening in the logs during that time? Are the number of messages

RE: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-07 Thread Goodall, Mark (UK)
I would like to say that I have also had this issue. In two situations, one using Accumulo to store information and also when running multiple streaming jobs within the same streaming context (e.g. multiple save to hdfs). In my case the situation worsens when one of the jobs, which has a long

Graceful shutdown drops processing in Spark Streaming

2015-10-07 Thread Michal Čizmazia
After triggering the graceful shutdown on the following application, the application stops before the windowed stream reaches its slide duration. As a result, the data is not completely processed (i.e. saveToMyStorage is not called) before shutdown. According to the documentation, graceful

Re: Spark job workflow engine recommendations

2015-10-07 Thread Nick Pentreath
We're also using Azkaban for scheduling, and we simply use spark-submit via she'll scripts. It works fine. The auto retry feature with a large number of retries (like 100 or 1000 perhaps) should take care of long-running jobs with restarts on failure. We haven't used it for streaming yet

Re: Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Sean Owen
-dev Is r.getInt(ind) very large in some cases? I think there's not quite enough info here. On Wed, Oct 7, 2015 at 6:23 PM, wrote: > When running stand-alone cluster mode job, the process hangs up randomly > during a DataFrame flatMap or explode operation, in

RE: Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
It can be large yes. But, that still does not resolve the question of why it works in smaller environment, i.e. Local[32] or in cluster mode when using SQLContext instead of HiveContext. The process in general, is a RowNumber() hiveQL operation, that is why I need HiveContext. I have the

Re: How can I disable logging when running local[*]?

2015-10-07 Thread Alex Kozlov
Hmm, clearly the parameter is not passed to the program. This should be an activator issue. I wonder how do you specify the other parameters, like driver memory, num cores, etc.? Just out of curiosity, can you run a program: import org.apache.spark.SparkConf val out=new

Re: spark performance non-linear response

2015-10-07 Thread Sean Owen
OK, next question then is: if this is wall-clock time for the whole process, then, I wonder if you are just measuring the time taken by the longest single task. I'd expect the time taken by the longest straggler task to follow a distribution like this. That is, how balanced are the partitions?

DataFrame with bean class

2015-10-07 Thread VJ
I have a bean class defined as follows: class result { private String name; public result() { }; public String getname () {return name;} public void setname (String s) {name = s;) } I then define DataFrame x = SqlContext.createDataFrame(myrdd, result.class); x.show() When I run this job, I

Re: spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Additional missing relevant information: Im running a transformation, there are no Shuffles occurring and at the end im performing a lookup of 4 partitions on the driver. On 10/7/15 11:26 AM, Yadid Ayzenberg wrote: Hi All, Im using spark 1.4.1 to to analyze a largish data set (several

Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney
I've noticed this as well and am curious if there is anything more people can say. My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly,

Re: This post has NOT been accepted by the mailing list yet.

2015-10-07 Thread Richard Hillegas
Hi Akhandeshi, It may be that you are not seeing your own posts because you are sending from a gmail account. See for instance https://support.google.com/a/answer/1703601?hl=en Hope this helps, Rick Hillegas STSM, IBM Analytics, Platform - IBM USA akhandeshi wrote on

Re: Temp files are not removed when done (Mesos)

2015-10-07 Thread Iulian Dragoș
It is indeed a bug. I believe the shutdown procedure in #7820 only kicks in when the external shuffle service is enabled (a pre-requisite of dynamic allocation). As a workaround you can use dynamic allocation (you can set spark.dynamicAllocation.maxExecutors and

Re: Temp files are not removed when done (Mesos)

2015-10-07 Thread Iulian Dragoș
https://issues.apache.org/jira/browse/SPARK-10975 On Wed, Oct 7, 2015 at 11:36 AM, Iulian Dragoș wrote: > It is indeed a bug. I believe the shutdown procedure in #7820 only kicks > in when the external shuffle service is enabled (a pre-requisite of dynamic >

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread sethah
Regarding features, the general workflow for the Spark community when adding new features is to first add them in Scala (since Spark is written in Scala). Once this is done, a Jira ticket will be created requesting that the feature be added to the Python API (example - SPARK-9773

Re: spark multi tenancy

2015-10-07 Thread Steve Loughran
On 7 Oct 2015, at 11:06, ayan guha > wrote: Can queues also be used to separate workloads? yes; that's standard practise. Different YARN queues can have different maximum memory & CPU, and you can even tag queues as "pre-emptible", so more

Parquet file size

2015-10-07 Thread Younes Naguib
Hi, I'm reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day) Select from tbl_tsv; This works nicely, but generates small parquet files (15MB). I wanted to generate larger files, any idea how to address this? Thanks,

Re: compatibility issue with Jersey2

2015-10-07 Thread Marcelo Vanzin
Seems like you might be running into https://issues.apache.org/jira/browse/SPARK-10910. I've been busy with other things but plan to take a look at that one when I find time... right now I don't really have a solution, other than making sure your application's jars do not include those classes the

Optimal way to avoid processing null returns in Spark Scala

2015-10-07 Thread swetha
Hi, I have the following functions that I am using for my job in Scala. If you see the getSessionId function I am returning null sometimes. If I return null the only way that I can avoid processing those records is by filtering out null records. I wanted to avoid having another pass for filtering

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
Hi YiZhi Liu, The spark.ml classes are part of the higher-level "Pipelines" API, which works with DataFrames. When creating this API, we decided to separate it from the old API to avoid confusion. You can read more about it here: http://spark.apache.org/docs/latest/ml-guide.html For (3): We

Re: Spark job workflow engine recommendations

2015-10-07 Thread Vikram Kone
Hien, I saw this pull request and from what I understand this is geared towards running spark jobs over hadoop. We are using spark over cassandra and not sure if this new jobtype supports that. I haven't seen any documentation in regards to how to use this spark job plugin, so that I can test it

Re: Spark job workflow engine recommendations

2015-10-07 Thread Hien Luu
The spark job type was added recently - see this pull request https://github.com/azkaban/azkaban-plugins/pull/195. You can leverage the SLA feature to kill a job if it ran longer than expected. BTW, we just solved the scalability issue by supporting multiple executors. Within a week or two, the

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Michael Armbrust
> > At my company we use Avro heavily and it's not been fun when i've tried to > work with complex avro schemas and python. This may not be relevant to you > however...otherwise I found Python to be a great fit for Spark :) > Have you tried using https://github.com/databricks/spark-avro ? It

Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
When running stand-alone cluster mode job, the process hangs up randomly during a DataFrame flatMap or explode operation, in HiveContext: -->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r) This does not happen either with SQLContext in cluster, or Hive/SQL in local mode, where it

Re: compatibility issue with Jersey2

2015-10-07 Thread Gary Ogden
What you suggested seems to have worked for unit tests. But now it throws this at run time on mesos with spark-submit: Exception in thread "main" java.lang.LinkageError: loader constraint violation: when resolving method