Hi All,
I have data partitioned by year=/month=mm/day=dd, what is the best way
to get two months of data from a given year (let's say June and July)?
Two ways I can think of:
1. use unionAll
df1 = sqc.read.parquet('xxx/year=2015/month=6')
df2 = sqc.read.parquet('xxx/year=2015/month=7')
df =
It depends how many partitions you have and if you are only doing a single
operation. Loading all the data and filtering will require us to scan the
directories to discover all the months. This information will be cached.
Then we should prune and avoid reading unneeded data.
Option 1 does not
Hi Jorge,
Unfortunately, I couldn't transform the data as you suggested.
This is what I get:
+---+-+-+
| id|pageIndex| pageVec|
+---+-+-+
|0.0| 3.0|(3,[],[])|
|1.0| 0.0|(3,[0],[1.0])|
|2.0| 2.0|(3,[2],[1.0])|
|3.0|
Hi Ben,
> My company uses Lamba to do simple data moving and processing using python
> scripts. I can see using Spark instead for the data processing would make it
> into a real production level platform.
That may be true. Spark has first class support for Python which
should make your life
Hi Xiangrui,
For the following problem, I found out an issue ticket you posted before
https://issues.apache.org/jira/browse/HADOOP-10614
I wonder if this has been fixed in Spark 1.5.2 which I believe so. Any
suggestion on how to fix it?
Thanks
Hao
From: Lin, Hao [mailto:hao@finra.org]
Hi Guys,
I need help with Spark memory errors when executing ML pipelines.
The error that I see is:
16/02/02 20:34:17 INFO Executor: Executor is trying to kill task
32.0 in stage 32.0 (TID 3298)
16/02/02 20:34:17 INFO Executor: Executor is trying to kill task
12.0 in stage 32.0 (TID 3278)
Can you share some code that produces the error? It is probably not
due to spark but rather the way data is handled in the user code.
Does your code call any reduceByKey actions? These are often a source
for OOM errors.
On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov wrote:
I’m trying to create a DF for an external Hive table that is in HBase.
I get the a NoSuchMethodError
For the memoryOvethead I have the default of 10% of 16g, and Spark version is
1.5.2.
Stefan Panayotov, PhD
Sent from Outlook Mail for Windows 10 phone
From: Ted Yu
Sent: Tuesday, February 2, 2016 4:52 PM
To: Jakob Odersky
Cc: Stefan Panayotov; user@spark.apache.org
Subject: Re: Spark 1.5.2
Querying a service or a database from a Spark job is in most cases an
anti-pattern, but there are exceptions. The jobs become unstable and
indeterministic by relying on a live database.
The recommended pattern is to take regular dumps of the database to
your cluster storage, e.g. HDFS, and join
What value do you use for spark.yarn.executor.memoryOverhead ?
Please see https://spark.apache.org/docs/latest/running-on-yarn.html for
description of the parameter.
Which Spark release are you using ?
Cheers
On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky wrote:
> Can you
Hi,
I'm struggling around an issue ever since I tried to upgrade my Spark
Streaming solution from 1.4.1 to 1.5+.
I have a Spark Streaming app which creates 3 ReceiverInputDStreams
leveraging KinesisUtils.createStream API.
I used to leverage a timeout to terminate my app
Hi Charles,
You may find slides 16-20 from this deck useful:
http://www.slideshare.net/mg007/big-data-trends-challenges-opportunities-57744483
I used it for a talk that I gave to MS students last week. I wanted to give
them some context before describing Spark.
It doesn’t cover all the stuff
Looks like this is related:
HIVE-12406
FYI
On Tue, Feb 2, 2016 at 1:40 PM, Doug Balog wrote:
> I’m trying to create a DF for an external Hive table that is in HBase.
> I get the a NoSuchMethodError
>
Look at part#3 in below blog:
http://www.openkb.info/2015/06/resource-allocation-configurations-for.html
You may want to increase the executor memory, not just the
spark.yarn.executor.memoryOverhead.
On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov wrote:
> For the
Hello,
I find myself in need of being able to process a large number of files (28M)
stored in a deeply nested folder hierarchy (Pairtree... a multi-level
hashtable-on-disk -like structure). Here's an example path:
./udel/pairtree_root/31/74/11/11/56/89/39/3174568939/3174568939.zip
I
I think spark dataframe supports more than just SQL. It is more like pandas
dataframe.( I rarely use the SQL feature. )
There are a lot of novelties in dataframe so I think it is quite optimize
for many tasks. The in-memory data structure is very memory efficient. I
just change a very slow RDD
>
> A principal difference between RDDs and DataFrames/Datasets is that the
> latter have a schema associated to them. This means that they support only
> certain types (primitives, case classes and more) and that they are
> uniform, whereas RDDs can contain any serializable object and must not
>
When using ALS, is it possible to use recommendProductsForUser for a subset of
users?
Currently, productFeatures and userFeatures are val. Is there a workaround for
it? Using recommendForUser repeatedly would not work in my case, since it would
be too slow with many users.
Thank you,
Sure, having a common distributed query and compute engine for all kind of
data source is alluring concept to market and advertise and to attract
potential customers (non engineers, analyst, data scientist). But it's
nothing new!..but darn old school. it's taking bits and pieces from
existing sql
To address one specific question:
> Docs says it usues sun.misc.unsafe to convert physical rdd structure into
byte array at some point for optimized GC and memory. My question is why is
it only applicable to SQL/Dataframe and not RDD? RDD has types too!
A principal difference between RDDs and
Hi Michael,
Is there a section in the spark documentation demonstrate how to serialize
arbitrary objects in Dataframe? The last time I did was using some User
Defined Type (copy from VectorUDT).
Best Regards,
Jerry
On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust
i am seeing make-distribution fail because lib_managed does not exist. what
seems to happen is that sql/hive module gets build and creates this
directory. but after this sometime later module spark-parent gets build,
which includes:
[INFO] Building Spark Project Parent POM 1.6.0-SNAPSHOT
[INFO]
Hi,
I would like to know how to calculate how much -executor-memory should we
allocate , how many num-executors,total-executor-cores we should give while
submitting spark jobs .
Is there any formula for it ?
Thanks,
Divya
well the "hadoop" way is to save to a/b and a/c and read from a/* :)
On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam wrote:
> Hi Spark users and developers,
>
> anyone knows how to union two RDDs without the overhead of it?
>
> say rdd1.union(rdd2).saveTextFile(..)
> This
i am surprised union introduces a stage. UnionRDD should have only narrow
dependencies.
On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers wrote:
> well the "hadoop" way is to save to a/b and a/c and read from a/* :)
>
> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam
Hello,
In our Spark streaming application, we are forming DStreams made of objects
a rather large composite class. I have discovered that in order to do some
operations like RDD.subtract(), they are only successful for complex
objects such as these by overriding toString() and hashCode() methods
Hi Jerry,
Yes I read that benchmark. And doesn't help in most cases. I'll give you
example of one of our application. It's a memory hogger by nature since it
works on groupByKey and performs combinatorics on Iterator. So it maintain
few structures inside task. It works on mapreduce with half the
Agree with Koert that UnionRDD should have a narrow dependencies .
Although union of two RDDs increases the number of tasks to be executed (
rdd1.partitions + rdd2.partitions) .
If your two RDDs have same number of partitions , you can also use
zipPartitions, which causes lesser number of tasks,
with respect to joins, unfortunately not all implementations are available.
for example i would like to use joins where one side is streaming (and the
other cached). this seems to be available for DataFrame but not for RDD.
On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel
For you case, it's true.
But not always correct for a pipeline model, some transformers in pipeline
will change the features such as OneHotEncoder.
2016-02-03 1:21 GMT+08:00 jmvllt :
> Hi everyone,
>
> This may sound like a stupid question but I need to be sure of this
There is chance that the log message may change in future releases.
Log snooping would be broken.
FYI
On Mon, Feb 1, 2016 at 9:55 PM, Takeshi Yamamuro
wrote:
> Hi,
>
> Currently, there is no way to check the size except for snooping INFO-logs
> in a driver;
>
> 16/02/02
Devesh,
The cbind-like operation is not supported by Scala DataFrame API, so it is also
not supported in SparkR.
You may try to workaround this by trying the approach in
http://stackoverflow.com/questions/32882529/how-to-zip-twoor-more-dataframe-in-spark
You could also submit a JIRA
Hi All,
Spark job stage having saveAsHadoopFile fails with ExecutorLostFailure
whenever the Executor is run with more cores. The stage is not memory
intensive, executor has 20GB memory. for example,
6 executors each with 6 cores, ExecutorLostFailure happens
10 executors each with 2 cores,
It's possible you could (ab)use updateStateByKey or mapWithState for this.
But honestly it's probably a lot more straightforward to just choose a
reasonable batch size that gets you a reasonable file size for most of your
keys, then use filecrush or something similar to deal with the hdfs small
Thanks David.
I am looking at extending the SparkSQL library with a custom
package...hence was looking at more from details on any specific classes
to be extended or implement (with) to achieve the redirect of calls to my
module (when using .format).
If you have any info on these lines do
Hi,
We run Spark Streaming on YARN,the Streaming Driver restart very often.
I don't known what's the matter.
The exception is below:
16/02/01 18:55:14 ERROR scheduler.JobScheduler: Error running job streaming job
1454324113000 ms.0
org.apache.spark.SparkException: Job aborted due to stage
Hello every one.
I have a some trouble to run word2vec, and run the libs…
Is it possible to use spark MLLib as embedded library (like mllib.jar +
spark-core.jar) inside Tomcat application (it is already has hadoop libs)? By
default it is huge in one jar contains all dependencies and after
Hi Robert,
I just use textFile. Here is the simple code:
val fs3File=sc.textFile("s3n://my bucket/myfolder/")
fs3File.count
do you suggest I should use sc.parallelize?
many thanks
From: Robert Collich [mailto:rcoll...@gmail.com]
Sent: Monday, February 01, 2016 6:54 PM
To: Lin, Hao; user
Hi everyone,
This may sound like a stupid question but I need to be sure of this :
Given a dataframe composed by « n » features : f1, f2, …, fn
For each row of my dataframe, I create a labeled point :
val row_i = LabeledPoint(label, Vectors.dense(v1_i,v2_i,…, vn_i) )
where v1_i,v2_i,…, vn_i
bq. Failed to connect to master XXX:7077
Is the 'XXX' above the hostname for the new master ?
Thanks
On Tue, Feb 2, 2016 at 1:48 AM, Anthony Tang
wrote:
> Hi -
>
> I'm running Spark 1.5.2 in standalone mode with multiple masters using
> zookeeper for failover. The
Hi David,
My company uses Lamba to do simple data moving and processing using python
scripts. I can see using Spark instead for the data processing would make it
into a real production level platform. Does this pave the way into replacing
the need of a pre-instantiated cluster in AWS or bought
Divya,
According to my recent Spark tuning experiences, optimal executor-memory
size not only depends on your workload characteristics (e.g. working set
size at each job stage) and input data size, but also depends on your total
available memory and memory requirements of other components like
Hi,
I have data set like :
Dataset 1
HeaderCol1 HeadCol2 HeadCol3
dataset 1 dataset2 dataset 3
dataset 11 dataset13 dataset 13
dataset 21 dataset22 dataset 23
Datset 2
HeadColumn1 HeadColumn2HeadColumn3 HeadColumn4
Tag1 Dataset1
Tag2 Dataset1
so latest optimizations done on spark 1.4 and 1.5 releases are mostly from
project Tungsten. Docs says it usues sun.misc.unsafe to convert physical
rdd structure into byte array at some point for optimized GC and memory. My
question is why is it only applicable to SQL/Dataframe and not RDD? RDD
Hi,
I read about release notes and few slideshares on latest optimizations done
on spark 1.4 and 1.5 releases. Part of which are optimizations from project
Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd
structure into byte array before shuffle for optimized GC and memory. My
Dataset will have access to some of the catalyst/tungsten optimizations
while also giving you scala and types. However that is currently
experimental and not yet as efficient as it could be.
On Feb 2, 2016 7:50 PM, "Nirav Patel" wrote:
> Sure, having a common distributed
While you can construct the SQL string dynamically in scala/java/python, it
would be best to use the Dataframe API for creating dynamic SQL queries. See
http://spark.apache.org/docs/1.5.2/sql-programming-guide.html for details.
On Feb 2, 2016, at 6:49 PM, Divya Gehlot
blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px
#715FFA solid !important; padding-left:1ex !important; background-color:white
!important; } Yes, it's the IP address/host.
Sent from Yahoo Mail for iPad
On Tuesday, February 2, 2016, 8:04 AM, Ted Yu
Hi Spark users and developers,
anyone knows how to union two RDDs without the overhead of it?
say rdd1.union(rdd2).saveTextFile(..)
This requires a stage to union the 2 rdds before saveAsTextFile (2 stages).
Is there a way to skip the union step but have the contents of the two rdds
save to the
I dont understand why one thinks RDD of case object doesn't have
types(schema) ? If spark can convert RDD to DataFrame which means it
understood the schema. SO then from that point why one has to use SQL
features to do further processing? If all spark need for optimizations is
schema then what
For Kafka direct stream, is there a way to set the time between successive
retries? From my testing, it looks like it is 200ms. Any way I can increase
the time?
Hi Nirav,
I'm sure you read this?
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
There is a benchmark in the article to show that dataframe "can" outperform
RDD implementation by 2 times. Of course, benchmarks can be "made". But
from the
Hi,
Does Spark supports dyamic sql ?
Would really appreciate the help , if any one could share some
references/examples.
Thanks,
Divya
54 matches
Mail list logo