Re: Use cases around image/video processing in spark

2016-08-10 Thread Benjamin Fradet
Hi,

Check out the the thunder project


On Wed, Aug 10, 2016 at 5:20 PM, Deepak Sharma 
wrote:

> Hi
> If anyone is using or knows about github repo that can help me get started
> with image and video processing using spark.
> The images/videos will be stored in s3 and i am planning to use s3 with
> Spark.
> In this case , how will spark achieve distributed processing?
> Any code base or references is really appreciated.
>
> --
> Thanks
> Deepak
>



-- 
Ben Fradet.


Re: ml ALS.fit(..) issue

2016-07-22 Thread Benjamin Fradet
Seems like there is an incompatibility regarding scala versions between
your program and the scala version Spark was compiled against.
Either you're using scala 2.11 and your spark installation was built using
2.10 or the other way around.

On Fri, Jul 22, 2016 at 11:06 PM, Pedro Rodriguez 
wrote:

> The dev list is meant for working on development of Spark, not as a way of
> escalating an issue just fyi.
>
> If someone hasn't replied on the user list either you haven't given it
> enough time or no one has a fix for you. I've definitely gotten replies
> from committers multiple times to many questions so its definitely *not*
> the case that they don't care
>
> On Fri, Jul 22, 2016 at 10:18 AM, VG  wrote:
>
>> Dev team,
>>
>> Can someone please help me here.
>>
>> -VG
>>
>> On Fri, Jul 22, 2016 at 8:30 PM, VG  wrote:
>>
>>> Can someone please help here.
>>>
>>> I tried both scala 2.10 and 2.11 on the system
>>>
>>>
>>>
>>> On Fri, Jul 22, 2016 at 7:59 PM, VG  wrote:
>>>
 I am using version 2.0.0-preview



 On Fri, Jul 22, 2016 at 7:47 PM, VG  wrote:

> I am running into the following error when running ALS
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
> at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)
> at yelp.TestUser.main(TestUser.java:101)
>
> here line 101 in the above error is the following in code.
>
> ALSModel model = als.fit(training);
>
>
> Does anyone has a suggestion what is going on here and where I might
> be going wrong ?
> Please suggest
>
> -VG
>


>>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
Ben Fradet.


Re: Spark Streaming KafkaUtils missing Save API?

2016-01-15 Thread Benjamin Fradet
There was a PR regarding this which was closed but the author of the PR
created a spark-package: https://github.com/cloudera/spark-kafka-writer.

I don't know exactly why it was decided not be incorporated into spark
however.
On 15 Jan 2016 8:04 p.m., "Renyi Xiong"  wrote:

> Hi,
>
> We noticed there's no Save method in KafkaUtils. we do have scenarios
> where we want to save RDD back to Kafka queue to be consumed by down stream
> streaming applications.
>
> I wonder if this is a common scenario, if yes, any plan to add it?
>
> Thanks,
> Renyi.
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Benjamin Fradet
+1
On 22 Dec 2015 9:54 p.m., "Andrew Or"  wrote:

> +1
>
> 2015-12-22 12:43 GMT-08:00 Reynold Xin :
>
>> +1
>>
>>
>> On Tue, Dec 22, 2015 at 12:29 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> I'll kick the voting off with a +1.
>>>
>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 1.6.0!

 The vote is open until Friday, December 25, 2015 at 18:00 UTC and
 passes if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.6.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is *v1.6.0-rc4
 (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
 *

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1176/

 The test repository (versioned as v1.6.0-rc4) for this release can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1175/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/

 ===
 == How can I help test this release? ==
 ===
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 
 == What justifies a -1 vote for this release? ==
 
 This vote is happening towards the end of the 1.6 QA period, so -1
 votes should only occur for significant regressions from 1.5. Bugs already
 present in 1.5, minor regressions, or bugs related to new features will not
 block this release.

 ===
 == What should happen to JIRA tickets still targeting 1.6.0? ==
 ===
 1. It is OK for documentation patches to target 1.6.0 and still go into
 branch-1.6, since documentations will be published separately from the
 release.
 2. New features for non-alpha-modules should target 1.7+.
 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
 target version.


 ==
 == Major changes to help you focus your testing ==
 ==

 Notable changes since 1.6 RC3

   - SPARK-12404 - Fix serialization error for Datasets with
 Timestamps/Arrays/Decimal
   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
   - SPARK-12395 - Fix join columns of outer join for DataFrame using
   - SPARK-12413 - Fix mesos HA

 Notable changes since 1.6 RC2
 - SPARK_VERSION has been set correctly
 - SPARK-12199 ML Docs are publishing correctly
 - SPARK-12345 Mesos cluster mode has been fixed

 Notable changes since 1.6 RC1
 Spark Streaming

- SPARK-2629  
trackStateByKey has been renamed to mapWithState

 Spark SQL

- SPARK-12165 
SPARK-12189  Fix
bugs in eviction of storage memory by execution.
- SPARK-12258  
 correct
passing null into ScalaUDF

 Notable Features Since 1.5Spark SQL

- SPARK-11787  
 Parquet
Performance - Improve Parquet scan performance when using flat
schemas.
- SPARK-10810 
Session Management - Isolated devault database (i.e USE mydb) even
on shared clusters.
- SPARK-   Dataset
API - A type-safe API (similar to RDDs) that performs many
operations on serialized binary data and code generation (i.e. Project
Tungsten).
- SPARK-1  
 Unified
Memory Management - 

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Benjamin Fradet
-1

For me the docs are not displaying except for the first page, for example
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/mllib-guide.html
is
a blank page.
This is because of SPARK-12199 
: Element[W|w]iseProductExample.scala is not the same in the docs and the
actual file name.

On Sat, Dec 12, 2015 at 6:39 PM, Michael Armbrust 
wrote:

> I'll kick off the voting with a +1.
>
> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> *
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> ===
>> == How can I help test this release? ==
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> == What justifies a -1 vote for this release? ==
>> 
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==
>> == Major changes to help you focus your testing ==
>> ==
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>- SPARK-2629  
>>trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>- SPARK-12165 
>>SPARK-12189  Fix
>>bugs in eviction of storage memory by execution.
>>- SPARK-12258  correct
>>passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>- SPARK-11787  Parquet
>>Performance - Improve Parquet scan performance when using flat
>>schemas.
>>- SPARK-10810 
>>Session Management - Isolated devault database (i.e USE mydb) even on
>>shared clusters.
>>- SPARK-   Dataset
>>API - A type-safe API (similar to RDDs) that performs many operations
>>on serialized binary data and code generation (i.e. Project Tungsten).
>>- SPARK-1  Unified
>>Memory Management - Shared memory for execution and caching instead
>>of exclusive division of the regions.
>>- SPARK-11197  SQL
>>Queries on Files - Concise syntax for running SQL queries over files
>>of any supported format without registering a table.
>>- SPARK-11745  Reading
>>non-standard JSON files - Added options to read non-standard JSON
>>files (e.g. single-quotes, unquoted attributes)
>>- 

[ML] Missing documentation for the IndexToString feature transformer

2015-12-05 Thread Benjamin Fradet
Hi,

I was wondering why the IndexToString

label
transformer was not documented in ml-features.md
.

If it's not intentional, having used it a few times, I'd be happy to submit
a jira and the pr associated.

Best,
Ben.

-- 
Ben Fradet.


Re: Grid search with Random Forest

2015-11-30 Thread Benjamin Fradet
Hi Ndjido,

This is because GBTClassifier doesn't yet have a rawPredictionCol like the.
RandomForestClassifier has.
Cf:
http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR"  wrote:

> Hi Joseph,
>
> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting a
> "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
> Gradient Boosting Trees classifier.
>
>
> Ardo
> On Tue, 1 Dec 2015 at 01:34, Joseph Bradley  wrote:
>
>> It should work with 1.5+.
>>
>> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar 
>> wrote:
>>
>>>
>>> Hi folks,
>>>
>>> Does anyone know whether the Grid Search capability is enabled since the
>>> issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
>>> column doesn't exist" when trying to perform a grid search with Spark 1.4.0.
>>>
>>> Cheers,
>>> Ardo
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: Unhandled case in VectorAssembler

2015-11-21 Thread Benjamin Fradet
Will do, thanks for your input.
On 21 Nov 2015 2:42 a.m., "Joseph Bradley"  wrote:

> Yes, please, could you send a JIRA (and PR)?  A custom error message would
> be better.
> Thank you!
> Joseph
>
> On Fri, Nov 20, 2015 at 2:39 PM, BenFradet 
> wrote:
>
>> Hey there,
>>
>> I noticed that there is an unhandled case in the transform method of
>> VectorAssembler if one of the input columns doesn't have one of the
>> supported type DoubleType, NumericType, BooleanType or VectorUDT.
>>
>> So, if you try to transform a column of StringType you get a cryptic
>> "scala.MatchError: StringType".
>> I was wondering if we shouldn't throw a custom exception indicating that
>> this is not a supported type.
>>
>> I can submit a jira and pr if needed.
>>
>> Best regards,
>> Ben.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Unhandled-case-in-VectorAssembler-tp15302.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>