es.apache.
>>>>>>>>>> org/jira/browse/SPARK-19031
>>>>>>>>>>
>>>>>>>>>> In the mean time you could try implementing your own Source, but
>>>>>>>>>> that is pretty low level and is not yet a stable
gt;
> On Fri, Aug 19, 2016 at 2:33 PM, Davies Liu <dav...@databricks.com> wrote:
>> You are using lots of tiny executors (128 executor with only 2G
>> memory), could you try with bigger executor (for example 16G x 16)?
>>
>> On Fri, Aug 19, 2016 at 8:19 AM, Ben Teeuwen
mpares.
>
> By the way, the feature size you select for the hasher should be a power of 2
> (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes are
> evenly distributed (see the section on HashingTF under
> http://spark.apache.org/docs/latest/ml-features.html#t
.
> On Aug 11, 2016, at 9:48 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Hi Ben,
>
> and that will take care of skewed data?
>
> Gourav
>
> On Thu, Aug 11, 2016 at 8:41 PM, Ben Teeuwen <bteeu...@gmail.com
> <mailto:bteeu...@gmail.com&g
> N-dimensional weight vector that will be definition have 0s for unseen
> indexes. At test time, any feature that only appears in your test set or new
> data will be hashed to an index in the weight vector that has value 0.
>
> So it could be useful for both of your pr
When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’?
If you .cache() and .count() to force a shuffle, it'll push the records that
will be joined to the same executors.
So;
a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()
a.count()
b =
o I’m curious what to use
instead.
> On Aug 4, 2016, at 3:54 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> Have you looked at pyspark.sql.functions.udf and the associated examples?
> 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen <bteeu...@gmail.com
> <mai
putCol="features",
> outputCol="features_vector")
>
> In [23]: hashed = hasher.transform(df)
>
> In [24]: hashed.show()
> +---+---+-+-+
> | id|feature| features| features_vector|
> +---+---+-+-+
> | 0|foo|[feature=foo]
Hi,
I’d like to use a UDF in pyspark 2.0. As in ..
def squareIt(x):
return x * x
# register the function and define return type
….
spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as
function_result from df’)
_
How can I register the function? I only see
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
at
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.sca
Related question: what are good profiling tools other than watching along the
application master with the running code?
Are there things that can be logged during the run? If I have say 2 ways of
accomplishing the same thing, and I want to learn about the time/memory/general
resource blocking
Hi,
I want to one hot encode a column containing 56 million distinct values. My
dataset is 800m rows + 17 columns.
I first apply a StringIndexer, but it already breaks there giving a OOM java
heap space error.
I launch my app on YARN with:
/opt/spark/2.0.0/bin/spark-shell --executor-memory 10G
Hi all,
I’ve posted a question regarding sessionizing events using scala and
mapWithState at
http://stackoverflow.com/questions/38541958/materialize-mapwithstate-statesnapshots-to-database-for-later-resume-of-spark-st
13 matches
Mail list logo