Hi,
I have a Spark Streaming program that is consuming message from Kafka and
has to decrypt and deserialize each message. I can implement it either as
Kafka deserializer (that will run in a receiver or the new receiver-less
Kafka consumer) or as RDD operations. What are the pros/cons of each?
The TSV original files is 600GB and generated 40k files of 15-25MB.
y
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: October-07-15 3:18 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size
Why do you want larger files? Doesn't the result Parquet file contain all
The reason why so many small files are generated should probably be the
fact that you are inserting into a partitioned table with three
partition columns.
If you want a large Parquet files, you may try to either avoid using
partitioned table, or using less partition columns (e.g., only year,
Hi everyone,
I'm curious about the difference between
ml.classification.LogisticRegression and
mllib.classification.LogisticRegressionWithLBFGS. Both of them are
optimized using LBFGS, the only difference I see is LogisticRegression
takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
So
Hi,
I want to export my model to PMML. But there is no development about random
forest. It is planned to 1.6 version. Is it possible producing my model
(random forest) PMML xml format manuelly? Thanks.
Best,
yasemin
--
hiç ender hiç
Thanks guys !
On Wed, Oct 7, 2015 at 1:41 AM, Cody Koeninger wrote:
> Sure no prob.
>
> On Tue, Oct 6, 2015 at 6:35 PM, Tathagata Das wrote:
>
>> Given the interest, I am also inclining towards making it a public
>> developer API. Maybe even
These are true, but it's not because Spark is written in Scala; it's
because it executes in the JVM. So, Scala/Java-based apps have an
advantage in that they don't have to serialize data back and forth to
a Python process, which also brings a new set of things that can go
wrong. Python is also
Not sure "/C/DevTools/spark-1.5.1/bin/spark-submit.cmd" is a valid?
From: Hossein [mailto:fal...@gmail.com]
Sent: Wednesday, October 7, 2015 12:46 AM
To: Khandeshi, Ami
Cc: Sun, Rui; akhandeshi; user@spark.apache.org
Subject: Re: SparkR Error in sparkR.init(master=“local”) in RStudio
Have you
Thanks! Done.
https://issues.apache.org/jira/browse/SPARK-10995
On 7 October 2015 at 21:24, Tathagata Das wrote:
> Aaah, interesting, you are doing 15 minute slide duration. Yeah,
> internally the streaming scheduler waits for the last "batch" interval
> which has data to
-dev +user
1). Is that the reason why it's always slow in the first run? Or are there
> any other reasons? Apparently it loads data to memory every time so it
> shouldn't be something to do with disk read should it?
>
You are probably seeing the effect of the JVMs JIT. The first run is
Please find attached.
On Wed, Oct 7, 2015 at 7:36 PM, Ted Yu wrote:
> Hemant:
> Can you post the code snippet to the mailing list - other people would be
> interested.
>
> On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat
> wrote:
>
>> Will send you
I want to understand whats use of default size for a given datatype?
Following link mention that its for internal size estimation.
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DataType.html
Above behavior is also reflected in code where default value seems to be
used
Hi,
In our case, we're using
the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to
increase the size of the RDD partitions when loading text files, so it
would generate larger parquet files. We just set it in the Hadoop conf of
the SparkContext. You need to be careful though
Aaah, interesting, you are doing 15 minute slide duration. Yeah, internally
the streaming scheduler waits for the last "batch" interval which has data
to be processed, but if there is a sliding interval (i.e. 15 mins) that is
higher than batch interval, then that might not be run. This is indeed a
This question should be directed to user@
Can you use third party site for the images - they didn't go through.
On Wed, Oct 7, 2015 at 5:35 PM, UI-JIN LIM wrote:
> Hi. This is Ui Jin, Lim in Korea, LG CNS
>
>
>
> We had setup and are operating hbase 0.98.13 on our customer,
Hi,
I am new to Spark and I have been trying to run Spark in yarn-client mode.
I get this error in yarn logs :
Error: Could not find or load main class
org.apache.spark.executor.CoarseGrainedExecutorBackend
Also, I keep getting these warnings:
WARN YarnScheduler: Initial job has not accepted
It is my understanding that the default behavior of coalesce function when
the user reduce the number of partitions is to only merge them without
executing shuffle.
My question is: Is this merging smart? For example does spark try to merge
the small partitions first or the election of partitions
Well, I only have data for 2015-08. So, in the end, only 31 partitions
What I'm looking for, is some reasonably sized partitions.
In any case, just the idea of controlling the output parquet files size or
number would be nice.
Younes Naguib Streaming Division
Triton Digital | 1440
Hi Sushrut,
which packaging of Spark do you use ?
Do you have a working Yarn cluster (with at least one worker) ?
spark-hadoop-x ?
Regards
JB
On 10/08/2015 07:23 AM, Sushrut Ikhar wrote:
Hi,
I am new to Spark and I have been trying to run Spark in yarn-client mode.
I get this error in yarn
Why do you want larger files? Doesn't the result Parquet file contain
all the data in the original TSV file?
Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,
I’m reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month,
I did not realized that scala's and java's immutable collections uses
different api which causes this. Thank you for reminder. This makes some
sense now...
-- Původní zpráva --
Od: Jonathan Coveney
Komu: Jakub Dubovsky
Oh, this is an internal class of our project and I had used it without
realizing the source.
Anyway, the idea is to wrap the InternalRow in a class that derives from
Row. When you implement the functions of the trait 'Row ', the type
conversions from Row types to InternalRow types has to be done
I'm just reading data from HDFS through Spark. It throws
*java.lang.ClassCastException:
org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.io.BytesWritable* at line no 6. I never used LongWritable
in my code, no idea how the data was in that format.
Note : I'm not using
Can queues also be used to separate workloads?
On 7 Oct 2015 20:34, "Steve Loughran" wrote:
>
> > On 7 Oct 2015, at 09:26, Dominik Fries
> wrote:
> >
> > Hello Folks,
> >
> > We want to deploy several spark projects and want to use a unique
Thanks for the feedback.
Cassandra does not seem to be the issue. The time for writing to Cassandra
is in the same order of magnitude (see below)
The code structure is roughly as follows:
dstream.filter(pred).foreachRDD{rdd =>
val sparkT0 = currentTimeMs
val metrics =
Hi,
I do a sql query on about 10,000 partitioned orc files. Because of the
partition schema the files cannot be merged any longer (to reduce the
total number).
From this command hiveContext.sql(sqlText), the 10K tasks were created
to handle each file. Is it possible to use less tasks? How
On 7 Oct 2015, at 06:28, Krzysztof Zarzycki
> wrote:
Hi Vikram, So you give up using yarn-cluster mode of launching Spark jobs, is
that right? AFAIK when using yarn-cluster mode, the launch process
(spark-submit) monitors job running on YARN,
>From which jar WrappedInternalRow comes from?
It seems that I can't find it.
BTW
What I'm trying to do now is to create scala array from the fields and than
create Row out of that array.
The problem is that I get types mismatches...
On Wed, Oct 7, 2015 at 8:03 AM, Hemant Bhanawat
As per the Exception, it looks like there is a mismatch in actual sequence
file's value type and the one which is provided by you in your code.
Change BytesWritable
to *LongWritable * and feel the execution.
-Umesh
On Wed, Oct 7, 2015 at 2:41 PM, Vinoth Sankar wrote:
>
Currently we try to execute pyspark from user CLI, but in context of project
user, but get this error : (the cluster is kerberized)
[@edgenode1 ~]$ pyspark --master yarn --num-executors 5 --proxy-user
Python 2.7.5 (default, Jun 24 2015, 00:41:19)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on
Thanks!
Can you check if you can provide example of the conversion?
On Wed, Oct 7, 2015 at 2:05 PM, Hemant Bhanawat
wrote:
> Oh, this is an internal class of our project and I had used it without
> realizing the source.
>
> Anyway, the idea is to wrap the InternalRow in
> On 7 Oct 2015, at 09:26, Dominik Fries wrote:
>
> Hello Folks,
>
> We want to deploy several spark projects and want to use a unique project
> user for each of them. Only the project user should start the spark
> application and have the corresponding packages
I think Java's immutable collections are fine with respect to kryo --
that's not the same as Guava.
On Wed, Oct 7, 2015 at 11:56 AM, Jakub Dubovsky
wrote:
> I did not realized that scala's and java's immutable collections uses
> different api which causes this.
Hello,
I have the following question:
I have two scenarios:
1) in one scenario (if I'm connected on the target node) the master starts
successfully.
Its log contains:
Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp
Thanks TD and Ashish.
On Mon, Oct 5, 2015 at 9:14 PM, Tathagata Das wrote:
> You could create a threadpool on demand within the foreachPartitoin
> function, then handoff the REST calls to that threadpool, get back the
> futures and wait for them to finish. Should be pretty
We’re deploying using YARN in cluster mode, to take advantage of automatic
restart of long running streaming app. We’ve also done a POC on top of
Mesos+Marathon, that’s always an option.
For monitoring / alerting, we’re using a combination of:
* Spark REST API queried from OpsView via
Hemant:
Can you post the code snippet to the mailing list - other people would be
interested.
On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat
wrote:
> Will send you the code on your email id.
>
> On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen wrote:
>
>>
Will send you the code on your email id.
On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen wrote:
> Thanks!
> Can you check if you can provide example of the conversion?
>
>
> On Wed, Oct 7, 2015 at 2:05 PM, Hemant Bhanawat
> wrote:
>
>> Oh, this is an
Tried, multiple permutation of setting home… Still same issue
> Sys.setenv(SPARK_HOME="c:\\DevTools\\spark-1.5.1")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths()))
> library(SparkR)
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
Hello,
I have the following question:
I have two scenarios:
1) in one scenario (if I'm connected on the target node) the master starts
successfully.
Its log contains:
Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp
Hi All,
Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes
of data). The RDD is partitioned into 2048 partitions which are more or
less equal and entirely cached in RAM.
I evaluated the performance on several cluster sizes, and am witnessing
a non linear (power)
I seem to see this for many of my posts... does anyone have solution?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/This-post-has-NOT-been-accepted-by-the-mailing-list-yet-tp24969.html
Sent from the Apache Spark User List mailing list archive at
When you say that the largest difference is from metrics.collect, how are
you measuring that? Wouldn't that be the difference between
max(partitionT1) and sparkT1, not sparkT0 and sparkT1?
As for further places to look, what's happening in the logs during that
time? Are the number of messages
I would like to say that I have also had this issue.
In two situations, one using Accumulo to store information and also when
running multiple streaming jobs within the same streaming context (e.g.
multiple save to hdfs). In my case the situation worsens when one of the jobs,
which has a long
After triggering the graceful shutdown on the following application, the
application stops before the windowed stream reaches its slide duration. As
a result, the data is not completely processed (i.e. saveToMyStorage is not
called) before shutdown.
According to the documentation, graceful
We're also using Azkaban for scheduling, and we simply use spark-submit via
she'll scripts. It works fine.
The auto retry feature with a large number of retries (like 100 or 1000
perhaps) should take care of long-running jobs with restarts on failure. We
haven't used it for streaming yet
-dev
Is r.getInt(ind) very large in some cases? I think there's not quite
enough info here.
On Wed, Oct 7, 2015 at 6:23 PM, wrote:
> When running stand-alone cluster mode job, the process hangs up randomly
> during a DataFrame flatMap or explode operation, in
It can be large yes. But, that still does not resolve the question of why it
works in smaller environment, i.e. Local[32] or in cluster mode when using
SQLContext instead of HiveContext.
The process in general, is a RowNumber() hiveQL operation, that is why I need
HiveContext.
I have the
Hmm, clearly the parameter is not passed to the program. This should be an
activator issue. I wonder how do you specify the other parameters, like
driver memory, num cores, etc.? Just out of curiosity, can you run a
program:
import org.apache.spark.SparkConf
val out=new
OK, next question then is: if this is wall-clock time for the whole
process, then, I wonder if you are just measuring the time taken by the
longest single task. I'd expect the time taken by the longest straggler
task to follow a distribution like this. That is, how balanced are the
partitions?
I have a bean class defined as follows:
class result {
private String name;
public result() { };
public String getname () {return name;}
public void setname (String s) {name = s;)
}
I then define
DataFrame x = SqlContext.createDataFrame(myrdd, result.class);
x.show()
When I run this job, I
Additional missing relevant information:
Im running a transformation, there are no Shuffles occurring and at the
end im performing a lookup of 4 partitions on the driver.
On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:
Hi All,
Im using spark 1.4.1 to to analyze a largish data set (several
I've noticed this as well and am curious if there is anything more people
can say.
My theory is that it is just communication overhead. If you only have a
couple of gigabytes (a tiny dataset), then spotting that into 50 nodes
means you'll have a ton of tiny partitions all finishing very quickly,
Hi Akhandeshi,
It may be that you are not seeing your own posts because you are sending
from a gmail account. See for instance
https://support.google.com/a/answer/1703601?hl=en
Hope this helps,
Rick Hillegas
STSM, IBM Analytics, Platform - IBM USA
akhandeshi wrote on
It is indeed a bug. I believe the shutdown procedure in #7820 only kicks in
when the external shuffle service is enabled (a pre-requisite of dynamic
allocation). As a workaround you can use dynamic allocation (you can set
spark.dynamicAllocation.maxExecutors and
https://issues.apache.org/jira/browse/SPARK-10975
On Wed, Oct 7, 2015 at 11:36 AM, Iulian Dragoș
wrote:
> It is indeed a bug. I believe the shutdown procedure in #7820 only kicks
> in when the external shuffle service is enabled (a pre-requisite of dynamic
>
Regarding features, the general workflow for the Spark community when adding
new features is to first add them in Scala (since Spark is written in
Scala). Once this is done, a Jira ticket will be created requesting that the
feature be added to the Python API (example - SPARK-9773
On 7 Oct 2015, at 11:06, ayan guha
> wrote:
Can queues also be used to separate workloads?
yes; that's standard practise. Different YARN queues can have different maximum
memory & CPU, and you can even tag queues as "pre-emptible", so more
Hi,
I'm reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)
Select from tbl_tsv;
This works nicely, but generates small parquet files (15MB).
I wanted to generate larger files, any idea how to address this?
Thanks,
Seems like you might be running into
https://issues.apache.org/jira/browse/SPARK-10910. I've been busy with
other things but plan to take a look at that one when I find time...
right now I don't really have a solution, other than making sure your
application's jars do not include those classes the
Hi,
I have the following functions that I am using for my job in Scala. If you
see the getSessionId function I am returning null sometimes. If I return
null the only way that I can avoid processing those records is by filtering
out null records. I wanted to avoid having another pass for filtering
Hi YiZhi Liu,
The spark.ml classes are part of the higher-level "Pipelines" API, which
works with DataFrames. When creating this API, we decided to separate it
from the old API to avoid confusion. You can read more about it here:
http://spark.apache.org/docs/latest/ml-guide.html
For (3): We
Hien,
I saw this pull request and from what I understand this is geared towards
running spark jobs over hadoop. We are using spark over cassandra and not
sure if this new jobtype supports that. I haven't seen any documentation in
regards to how to use this spark job plugin, so that I can test it
The spark job type was added recently - see this pull request
https://github.com/azkaban/azkaban-plugins/pull/195. You can leverage the
SLA feature to kill a job if it ran longer than expected.
BTW, we just solved the scalability issue by supporting multiple
executors. Within a week or two, the
>
> At my company we use Avro heavily and it's not been fun when i've tried to
> work with complex avro schemas and python. This may not be relevant to you
> however...otherwise I found Python to be a great fit for Spark :)
>
Have you tried using https://github.com/databricks/spark-avro ? It
When running stand-alone cluster mode job, the process hangs up randomly during
a DataFrame flatMap or explode operation, in HiveContext:
-->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r)
This does not happen either with SQLContext in cluster, or Hive/SQL in local
mode, where it
What you suggested seems to have worked for unit tests. But now it throws
this at run time on mesos with spark-submit:
Exception in thread "main" java.lang.LinkageError: loader constraint
violation: when resolving method
67 matches
Mail list logo