Best Practice for Spark Job Jar Generation

2016-12-22 Thread Chetan Khatri
Hello Spark Community,

For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and
then submit to spark-submit.

Example,

bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob
/home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar

But other folks has debate with for Uber Less Jar, Guys can you please
explain me best practice industry standard for the same.

Thanks,

Chetan Khatri.


Re: java.lang.AssertionError: assertion failed

2016-12-22 Thread Liang-Chi Hsieh

Hi,

I think there is an issue in `ExternalAppendOnlyMap.forceSpill` which is
called to release memory when there is another memory consumer tried to ask
more memory than current available.

I created a Jira and submit a PR for it. Please check out
https://issues.apache.org/jira/browse/SPARK-18986.



-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-AssertionError-assertion-failed-tp20277p20338.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: stratified sampling scales poorly

2016-12-22 Thread Liang-Chi Hsieh

Hi,

I quoted the description of `sampleByKeyExact`:

"This method differs from [[sampleByKey]] in that we make additional passes
over the RDD to
create a sample size that's exactly equal to the sum of math.ceil(numItems *
samplingRate)
over all key values with a 99.99% confidence. When sampling without
replacement, we need one
additional pass over the RDD to guarantee sample size; when sampling with
replacement, we need
two additional passes."

As you see, `sampleByKeyExact` needs additional passes over the RDD to make
sure returning correctly sample size.

If you don't need that, you can try `sampleByKey` which is also doing
stratified sampling without strict requirement of the correctness of  the
sample size.



Martin Le wrote
> Hi all,
> 
> I perform sampling on a DStream by taking samples from RDDs in the
> DStream.
> I have used two sampling mechanisms: simple random sampling and stratified
> sampling.
> 
> Simple random sampling: inputStream.transform(x => x.sample(false,
> fraction)).
> 
> Stratified sampling: inputStream.transform(x => x.sampleByKeyExact(false,
> fractions))
> 
> where fractions = Map(“key1”-> fraction,  “key2”-> fraction, …, “keyn”->
> fraction).
> 
> I have a question is that why stratified sampling scales poorly with
> different sampling fractions in this context? meanwhile simple random
> sampling scales well with different sampling fractions (I ran experiments
> on 4 nodes cluster )?
> 
> Thank you,
> 
> Martin





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/stratified-sampling-scales-poorly-tp20278p20337.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Aggregating over sorted data

2016-12-22 Thread Koert Kuipers
yes it's less optimal because an abstraction is missing and with
mapPartitions it is done without optimizations. but aggregator is not the
right abstraction to begin with, is assumes a monoid which means no
ordering guarantees. you need a fold operation.

On Dec 22, 2016 02:20, "Liang-Chi Hsieh"  wrote:

>
> You can't use existing aggregation functions with that. Besides, the
> execution plan of `mapPartitions` doesn't support wholestage codegen.
> Without that and some optimization around aggregation, that might be
> possible performance degradation. Also when you have more than one keys in
> a
> partition, you will need to take care of that in your function applied to
> each partition.
>
>
> Koert Kuipers wrote
> > it can also be done with repartition + sortWithinPartitions +
> > mapPartitions.
> > perhaps not as convenient but it does not rely on undocumented behavior.
> > i used this approach in spark-sorted. see here:
> > https://github.com/tresata/spark-sorted/blob/master/src/
> main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala
> >
> > On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh 
>
> > viirya@
>
> >  wrote:
> >
> >>
> >> I agreed that to make sure this work, you might need to know the Spark
> >> internal implementation for APIs such as `groupBy`.
> >>
> >> But without any more changes to current Spark implementation, I think
> >> this
> >> is the one possible way to achieve the required function to aggregate on
> >> sorted data per key.
> >>
> >>
> >>
> >>
> >>
> >> -
> >> Liang-Chi Hsieh | @viirya
> >> Spark Technology Center
> >> http://www.spark.tc/
> >> --
> >> View this message in context: http://apache-spark-
> >> developers-list.1001551.n3.nabble.com/Aggregating-over-
> >> sorted-data-tp1p20331.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> -
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Aggregating-over-
> sorted-data-tp1p20333.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Aggregating over sorted data

2016-12-22 Thread trsell
I would love this feature

On Thu, 22 Dec 2016, 18:45 assaf.mendelson,  wrote:

> It seems that this aggregation is for dataset operations only. I would
> have hoped to be able to do dataframe aggregation. Something along the line
> of: sort_df(df).agg(my_agg_func)
>
>
>
> In any case, note that this kind of sorting is less efficient than the
> sorting done in window functions for example. Specifically here what is
> happening is that first the data is shuffled and then the entire partition
> is sorted. It is possible to do it another way (although I have no idea how
> to do it in spark without writing a UDAF which is probably very
> inefficient). The other way would be to collect everything by key in each
> partition, sort within the key (which would be a lot faster since there are
> fewer elements) and then merge the results.
>
>
>
> I was hoping to find something like: Efficient sortByKey to work with…
>
>
>
> *From:* Koert Kuipers [via Apache Spark Developers List] 
> [mailto:ml-node+[hidden
> email] ]
> *Sent:* Thursday, December 22, 2016 7:14 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Aggregating over sorted data
>
>
>
> it can also be done with repartition + sortWithinPartitions +
> mapPartitions.
>
> perhaps not as convenient but it does not rely on undocumented behavior.
>
> i used this approach in spark-sorted. see here:
>
>
> https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala
>
> On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh <[hidden email]
> > wrote:
>
>
> I agreed that to make sure this work, you might need to know the Spark
> internal implementation for APIs such as `groupBy`.
>
> But without any more changes to current Spark implementation, I think this
> is the one possible way to achieve the required function to aggregate on
> sorted data per key.
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20331.html
>
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
>
> To unsubscribe e-mail: [hidden email]
> 
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20332.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
> --
> View this message in context: RE: Aggregating over sorted data
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


RE: Aggregating over sorted data

2016-12-22 Thread assaf.mendelson
It seems that this aggregation is for dataset operations only. I would have 
hoped to be able to do dataframe aggregation. Something along the line of: 
sort_df(df).agg(my_agg_func)

In any case, note that this kind of sorting is less efficient than the sorting 
done in window functions for example. Specifically here what is happening is 
that first the data is shuffled and then the entire partition is sorted. It is 
possible to do it another way (although I have no idea how to do it in spark 
without writing a UDAF which is probably very inefficient). The other way would 
be to collect everything by key in each partition, sort within the key (which 
would be a lot faster since there are fewer elements) and then merge the 
results.

I was hoping to find something like: Efficient sortByKey to work with…

From: Koert Kuipers [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n20332...@n3.nabble.com]
Sent: Thursday, December 22, 2016 7:14 AM
To: Mendelson, Assaf
Subject: Re: Aggregating over sorted data

it can also be done with repartition + sortWithinPartitions + mapPartitions.
perhaps not as convenient but it does not rely on undocumented behavior.
i used this approach in spark-sorted. see here:
https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala

On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh <[hidden 
email]> wrote:

I agreed that to make sure this work, you might need to know the Spark
internal implementation for APIs such as `groupBy`.

But without any more changes to current Spark implementation, I think this
is the one possible way to achieve the required function to aggregate on
sorted data per key.





-
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20331.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: [hidden 
email]



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20332.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here.
NAML




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20334.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Aggregating over sorted data

2016-12-22 Thread Liang-Chi Hsieh

You can't use existing aggregation functions with that. Besides, the
execution plan of `mapPartitions` doesn't support wholestage codegen.
Without that and some optimization around aggregation, that might be
possible performance degradation. Also when you have more than one keys in a
partition, you will need to take care of that in your function applied to
each partition.


Koert Kuipers wrote
> it can also be done with repartition + sortWithinPartitions +
> mapPartitions.
> perhaps not as convenient but it does not rely on undocumented behavior.
> i used this approach in spark-sorted. see here:
> https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala
> 
> On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>>
>> I agreed that to make sure this work, you might need to know the Spark
>> internal implementation for APIs such as `groupBy`.
>>
>> But without any more changes to current Spark implementation, I think
>> this
>> is the one possible way to achieve the required function to aggregate on
>> sorted data per key.
>>
>>
>>
>>
>>
>> -
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Aggregating-over-
>> sorted-data-tp1p20331.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20333.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org