You can simply save the join result distributedly, for example, as a HDFS file,
and then copy the HDFS file to a local file.
There is an alternative memory-efficient way to collect distributed data back
to driver other than collect(), that is toLocalIterator. The iterator will
consume as much
hi, consider transfer dataframe to rdd and then use* rdd.toLocalIterator *to
collect data on the driver node.
On Fri, Jul 15, 2016 at 9:05 AM, Pedro Rodriguez
wrote:
> Out of curiosity, is there a way to pull all the data back to the driver
> to save without collect()?
Hi,
I need to develop a service that will recommend user with other similar
users that he can connect to. For each user I have a data about user
preferences for specific items in the form:
user, item, preference
1,75, 0.89
2,168, 0.478
2,99, 0.321
3,31, 0.012
So
Thank you for your inputs. Will test it out and share my findings
On Thursday, July 14, 2016, CosminC wrote:
> Didn't have the time to investigate much further, but the one thing that
> popped out is that partitioning was no longer working on 1.6.1. This would
> definitely
Out of curiosity, is there a way to pull all the data back to the driver to
save without collect()? That is, stream the data in chunks back to the driver
so that maximum memory used comparable to a single node’s data, but all the
data is saved on one node.
—
Pedro Rodriguez
PhD Student in
Is there a simpler way to check if a data frame is cached other than:
dataframe.registerTempTable("cachedOutput")
assert(hc.isCached("cachedOutput"), "The table was not cached")
Thanks!
--
Cesar Flores
Hi,
My understanding is that the maximum size of a broadcast is the
Long.MAX_VALUE (and plus some more since the data is going to be
encoded to save space, esp. for catalyst-driver datasets).
Ad 2. Before the tasks access the broadcast variable it has to be sent
across network that may be too
Hi,
Please re-consider your wish since it is going to move all the
distributed dataset to the single machine of the driver and may lead
to OOME. It's more pro to save your result to HDFS or S3 or any other
distributed filesystem (that is accessible by the driver and
executors).
If you insist...
Hello,
I am using data frames to join two cassandra tables.
Currently when i invoke save on data frames as shown below it is saving the
join results on executor nodes.
joineddataframe.select(,
...).format("com.databricks.spark.csv").option("header", "true").save()
I would like to persist the
Hi,
I have a Spark 1.6.2 streaming job with multiple output operations (jobs)
doing idempotent changes in different repositories.
The problem is that I want to somehow pass errors from one output operation
to another such that in the current output operation I only update
previously successful
Hello All,
I am in the middle of designing real time data enhancement services using spark
streaming. As part of this, I have to look up some reference data while
processing the incoming stream.
I have below questions:
1) what is the maximum size of look up table / variable can be stored as
Hi everyone,
I am trying to filter my features based on the spark.mllib ChiSqSelector.
filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
model.transform(lp.features)))
However when I do the following I get the error below. Is there any other
way to filter my data to avoid
I witness really weird behavior when loading the data from RDBMS.
I tried different approach for loading the data - I provided a partitioning
column for make partitioning parallelism:
val df_init = sqlContext.read.format("jdbc").options(
Map("url" -> Configuration.dbUrl,
Are you re-sharding your kinesis stream as well?
I had a similar problem and increasing the number of kinesis stream shards
solved it.
--
*Daniel Santana*
Senior Software Engineer
EVERY*MUNDO*
25 SE 2nd Ave., Suite 900
Miami, FL 33131 USA
main:+1 (305) 375-0045
EveryMundo.com
Second to what Pedro said in the second paragraph.
Issuing http request per row would not scale.
On Thu, Jul 14, 2016 at 12:26 PM, Pedro Rodriguez
wrote:
> Hi Amit,
>
> Have you tried running a subset of the IDs locally on a single thread? It
> would be useful to
Hi Amit,
Have you tried running a subset of the IDs locally on a single thread? It
would be useful to benchmark your getProfile function for a subset of the
data then estimate how long the full data set would take then divide by
number of spark executor cores. This should at least serve as a
Hi,
I am on Spark 2.0 Review release. According to Spark 2.0 docs, to share
TempTable/View, I need to:
"to run the Thrift server in the old single-session mode, please set
option spark.sql.hive.thriftServer.singleSession to true."
Question: *When using HiveThriftServer2.startWithContext(),
I meant to say that first we can sort the individual partitions and then
sort them again by merging. Sort of a divide and conquer mechanism.
Does sortByKey take care of all this internally?
On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik wrote:
> Can we increase the sorting
Can we increase the sorting speed of RDD by doing a secondary sort first?
On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik wrote:
> Okay. Can't I supply the same partitioner I used for
> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>
> On 14-Jul-2016
Okay. Can't I supply the same partitioner I used for
"repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
On 14-Jul-2016 11:38 PM, "Koert Kuipers" wrote:
> repartitionAndSortWithinPartitions partitions the rdd and sorts within
> each partition. so each
Additional information: The batch duration in my app is 1 minute, from
Spark UI, for each batch, the difference between Output Op Duration and Job
Duration is big. E.g. Output Op Duration is 1min while Job Duration is 19s.
2016-07-14 10:49 GMT-07:00 Renxia Wang :
> Hi all,
repartitionAndSortWithinPartitions partitions the rdd and sorts within each
partition. so each partition is fully sorted, but the rdd is not sorted.
sortByKey is basically the same as repartitionAndSortWithinPartitions
except it uses a range partitioner so that the entire rdd is sorted.
however
Hi Koert
I have already used "repartitionAndSortWithinPartitions" for secondary
sorting and it works fine. Just wanted to know whether it will sort the
entire RDD or not.
On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers wrote:
> repartitionAndSortWithinPartit sort by keys,
repartitionAndSortWithinPartit sort by keys, not values per key, so not
really secondary sort by itself.
for secondary sort also check out:
https://github.com/tresata/spark-sorted
On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik wrote:
> Hi guys
>
> In my spark/scala code I
Hi all,
I am running a Spark Streaming application with Kinesis on EMR 4.7.1. The
application runs on YARN and use client mode. There are 17 worker nodes
(c3.8xlarge) with 100 executors and 100 receivers. This setting works fine.
But when I increase the number of worker nodes to 50, and increase
HI Talebzadeh,
sorry I forget to answer last part of your question:
At O/S level you should see many CoarseGrainedExecutorBackend through jps
each corresponding to one executor. Are they doing anything?
There is one worker with one executor bussy and the rest is almost idle:
PID USER PR
HI Talebzadeh,
we are using 6 worker machines - running.
We are reading the data through sqlContext (data frame) as it is suggested
in the documentation over the JdbcRdd
prop just specifies name, password, and driver class.
Right after this data load we register it as a temp table
val
Hi guys
In my spark/scala code I am implementing secondary sort. I wanted to know,
when I call the "repartitionAndSortWithinPartitions" method, the whole
(entire) RDD will be sorted or only the individual partitions will be
sorted?
If its the latter case, will applying a "sortByKey" after
Hi Jakub,
Sounds like one executor. Can you point out:
1. The number of slaves/workers you are running
2. Are you using JDBC to read data in?
3. Do you register DF as temp table and if so have you cached temp table
Sounds like only one executor is active and the rest are sitting
i have seen similar behavior in my standalone cluster, I tried to increase the
number of partitions and at some point it seems all the executors or worker
nodes start to make parallel connection to remote data store. But it would be
nice if someone could point us to some references on how to
Hi all,
what is the difference between JavaReceiverInputDStream and JavaDStream ?
I see that the last one is always used in alla custom receiver when the
createStream is going to be used for Python.
Thanks,
Paolo.
Hello,
I have a spark cluster running in a single mode, master + 6 executors.
My application is reading a data from database via DataFrame.read then
there is a filtering of rows. After that I re-partition data and I wonder
why on the executors page of the driver UI I see RDD blocks all
Hi All,
I have a requirement to call a rest service url for 300k customer ids.
Things I have tried so far is
custid_rdd = sc.textFile('file:Users/zzz/CustomerID_SC/Inactive User Hashed
LCID List.csv') #getting all the customer ids and building adds
profile_rdd = custid_rdd.map(lambda r:
I have a dataframe with few dimensions, for example:
I want to build a cube on i,j,k, and get a rank based on total per row (per
grouping)
so that when doing:
df.filter('i===3 && 'j===1).show
I will get
so basically, for any grouping combination, i need a separated dense rank
list (i,j,k,
I'm posting again, as the tables are not showing up in the emails..
I have a dataframe with few dimensions, for example:
+---+---+---+-+
| i| j| k|total|
+---+---+---+-+
| 3| 1| 1|3|
| 3| 1| 2|6|
| 3| 1| 3|9|
| 3| 1| 4| 12|
| 3| 1| 5| 15|
| 3| 1| 6|
or would it be common practice to just retain the original categories in
another df?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html
Sent from the Apache Spark User List mailing list archive at
Thanks Disha, that worked out well. Can you point me to an example of how to
decode my feature vectors in the dataframe, back into their categories?
--
View this message in context:
Hi,
(first ever post)
I experimenting with a Cloudera CDH5 cluster with Spark 1.5.0.
Have tried enabling the CSVSink metrics which seems to work to linux
directories such as /tmp.
However, I'm getting errors when trying to send to an HDFS directory.
Is it possible to use HDFS?
Error message
Hello,
The variable argsList is an array defined above the parallel block. This
variawis accessed inside the map function. Launcher.main is not threadsafe.
Is is not possible to specify to spark that every folder needs to be
processed as a separate process in a separate working directory?
Where is argsList defined? is Launcher.main() thread-safe? Note that if
multiple folders are processed in a node, multiple threads may concurrently run
in the executor, each processing a folder.
> On Jul 14, 2016, at 12:28, Balachandar R.A. wrote:
>
> Hello Ted,
>
Didn't have the time to investigate much further, but the one thing that
popped out is that partitioning was no longer working on 1.6.1. This would
definitely explain the 2x performance loss.
Checking 1.5.1 Spark logs for the same application showed that our
partitioner was working correctly, and
41 matches
Mail list logo