Re: Use Spark extension points to implement row-level security

2018-08-18 Thread Richard Siebeling
Thanks, this looks promising. I am trying to do it without a dependency on
Hive and was hoping that the extension hooks could be used to add a filter
transformation to the logical plan. I've seen some other email saying that
in the optimisation hook the logical is expected to stay the same (
https://stackoverflow.com/questions/40235566/transforming-spark-sql-ast-with-extraoptimizations/40273936
)

But I'm still hoping that some other extension hook can be used to add the
filter operation. Does anyone know that?

There is not much documentation on the extension hooks, could not make it
up from the existing documentation.

Regards,
Richard


Op vr 17 aug. 2018 om 15:33 schreef Maximiliano Patricio Méndez <
mmen...@despegar.com>

> Hi,
>
> I've added table level security using spark extensions based on the
> ongoing work proposed for ranger in RANGER-2128. Following the same logic,
> you could mask columns and work on the logical plan, but not filtering or
> skipping rows, as those are not present in these hooks.
>
> The only difficult I found was integrating extensions with pyspark, since
> in python the SparkContext is always created through the constructor and
> not using the scala getOrCreate() method (I've sent an email regarding
> this). But other than that, it works.
>
>
> On Fri, Aug 17, 2018, 03:56 Richard Siebeling 
> wrote:
>
>> Hi,
>>
>> I'd like to implement some kind of row-level security and am thinking of
>> adding additional filters to the logical plan possibly using the Spark
>> extensions.
>> Would this be feasible, for example using the injectResolutionRule?
>>
>> thanks in advance,
>> Richard
>>
>


Use Spark extension points to implement row-level security

2018-08-16 Thread Richard Siebeling
Hi,

I'd like to implement some kind of row-level security and am thinking of
adding additional filters to the logical plan possibly using the Spark
extensions.
Would this be feasible, for example using the injectResolutionRule?

thanks in advance,
Richard


Determine Cook's distance / influential data points

2017-12-13 Thread Richard Siebeling
Hi,

would it be possible to determine the Cook's distance using Spark?
thanks,
Richard


Re: Handling skewed data

2017-04-19 Thread Richard Siebeling
I'm also interested in this, does anyone this?

On 17 April 2017 at 17:17, Vishnu Viswanath 
wrote:

> Hello All,
>
> Does anyone know if the skew handling code mentioned in this talk
> https://www.youtube.com/watch?v=bhYV0JOPd9Y was added to spark?
>
> If so can I know where to look for more info, JIRA? Pull request?
>
> Thanks in advance.
> Regards,
> Vishnu Viswanath.
>
>
>


Re: Fast write datastore...

2017-03-15 Thread Richard Siebeling
maybe Apache Ignite does fit your requirements

On 15 March 2017 at 08:44, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Hi
> If queries are statics and filters are on the same columns, Cassandra is a
> good option.
>
> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>
> Hello there,
>
> I have one or more parquet files to read and perform some aggregate queries
> using Spark Dataframe. I would like to find a reasonable fast datastore
> that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to handle.
> But in the schema sense, this seems a great fit as ElasticSearch has smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: Continuous or Categorical

2017-03-01 Thread Richard Siebeling
I think it's difficult to determine with certainty if a variable is
continuous or categorical, what to do when the values are numbers like 1,
2, 2, 3, 4, 5. These values can both be continuous as categorical.
for exa
However you could perform some checks:
- are there any decimal values > it will probably be continuous
- are the values strings > it will be categorical

There are more test possible but it depends on what you know about the
data...

On 1 March 2017 at 15:36, Madabhattula Rajesh Kumar 
wrote:

> Hi,
>
> How to check given a set of values(example:- Column values in CSV file)
> are Continuous or Categorical? Any statistical test is available?
>
> Regards,
> Rajesh
>


Re: is it possible to read .mdb file in spark

2017-01-26 Thread Richard Siebeling
Hi,

haven't used it, but Jackcess should do the trick >
http://jackcess.sourceforge.net/
kind regards,
Richard

2017-01-25 11:47 GMT+01:00 Selvam Raman :

>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Richard Siebeling
Probably found it, it turns out that Mesos should be explicitly added while
building Spark, I assumed I could use the old build command that I used for
building Spark 2.0.0... Didn't see the two lines added in the
documentation...

Maybe these kind of changes could be added in the changelog under changes
of behaviour or changes in the build process or something like that,

kind regards,
Richard


On 9 January 2017 at 22:55, Richard Siebeling  wrote:

> Hi,
>
> I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not
> parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.
> Mesos is running fine (both the master as the slave, it's a single machine
> configuration).
>
> I really don't understand why this is happening since the same
> configuration but using a Spark 2.0.0 is running fine within Vagrant.
> Could someone please help?
>
> thanks in advance,
> Richard
>
>
>
>


Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Richard Siebeling
Hi,

I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not
parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.
Mesos is running fine (both the master as the slave, it's a single machine
configuration).

I really don't understand why this is happening since the same
configuration but using a Spark 2.0.0 is running fine within Vagrant.
Could someone please help?

thanks in advance,
Richard


Re: How to stop a running job

2016-10-06 Thread Richard Siebeling
I think I mean the job that Mark is talking about but that's also the thing
that's being stopped by the dcos command and (hopefully) the thing that's
being stopped by the dispatcher, isn't it?

It would be really good if the issue (SPARK-17064) would be resolved, but
for now I'll do with cancelling the planned tasks in the current job
(that's already a lot better than completing the whole job).

Thanks anyway for the answers, you helped me a lot,
kind regards,
Richard

On Wed, Oct 5, 2016 at 11:38 PM, Michael Gummelt 
wrote:

> You're using the proper Spark definition of "job", but I believe Richard
> means "driver".
>
> On Wed, Oct 5, 2016 at 2:17 PM, Mark Hamstra 
> wrote:
>
>> Yes and no.  Something that you need to be aware of is that a Job as such
>> exists in the DAGScheduler as part of the Application running on the
>> Driver.  When talking about stopping or killing a Job, however, what people
>> often mean is not just stopping the DAGScheduler from telling the Executors
>> to run more Tasks associated with the Job, but also to stop any associated
>> Tasks that are already running on Executors.  That is something that Spark
>> doesn't try to do by default, and changing that behavior has been an open
>> issue for a long time -- cf. SPARK-17064
>>
>> On Wed, Oct 5, 2016 at 2:07 PM, Michael Gummelt 
>> wrote:
>>
>>> If running in client mode, just kill the job.  If running in cluster
>>> mode, the Spark Dispatcher exposes an HTTP API for killing jobs.  I don't
>>> think this is externally documented, so you might have to check the code to
>>> find this endpoint.  If you run in dcos, you can just run "dcos spark kill
>>> ".
>>>
>>> You can also find which node is running the driver, ssh in, and kill the
>>> process.
>>>
>>> On Wed, Oct 5, 2016 at 1:55 PM, Richard Siebeling 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> how can I stop a long running job?
>>>>
>>>> We're having Spark running in Mesos Coarse-grained mode. Suppose the
>>>> user start a long running job, makes a mistake, changes a transformation
>>>> and runs the job again. In this case I'd like to cancel the first job and
>>>> after that start the second job. It would be a waste of resources to finish
>>>> the first job (which could possibly take several hours...)
>>>>
>>>> How can this be accomplished?
>>>> thanks in advance,
>>>> Richard
>>>>
>>>>
>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


How to stop a running job

2016-10-05 Thread Richard Siebeling
Hi,

how can I stop a long running job?

We're having Spark running in Mesos Coarse-grained mode. Suppose the user
start a long running job, makes a mistake, changes a transformation and
runs the job again. In this case I'd like to cancel the first job and after
that start the second job. It would be a waste of resources to finish the
first job (which could possibly take several hours...)

How can this be accomplished?
thanks in advance,
Richard


Re: building Spark 2.1 vs Java 1.8 on Ubuntu 16/06

2016-10-05 Thread Richard Siebeling
sorry, now with the link included, see
http://spark.apache.org/docs/latest/building-spark.html

On Wed, Oct 5, 2016 at 10:19 AM, Richard Siebeling 
wrote:

> Hi,
>
> did you set the following option: export MAVEN_OPTS="-Xmx2g
> -XX:ReservedCodeCacheSize=512m"
>
> kind regards,
> Richard
>
> On Tue, Oct 4, 2016 at 10:21 PM, Marco Mistroni 
> wrote:
>
>> Hi all
>>  my mvn build of Spark 2.1 using Java 1.8 is spinning out of memory with
>> an error saying it cannot allocate enough memory during maven compilation
>>
>> Instructions (in the Spark 2.0 page) says that MAVENOPTS are not needed
>> for Java 1.8 and , accoding to my understanding, spark build process will
>> add it
>> during the build via mvn
>> Note; i am not using Zinc. Rather, i am using my own Maven version
>> (3.3.9), launching this command from the main spark directory. The same
>> build works when i use Java 1.7(and MAVENOPTS)
>>
>> mvn -Pyarn -Dscala-2.11 -DskipTests clean package
>>
>> Could anyone assist?
>> kr
>>   marco
>>
>
>


Re: building Spark 2.1 vs Java 1.8 on Ubuntu 16/06

2016-10-05 Thread Richard Siebeling
Hi,

did you set the following option: export MAVEN_OPTS="-Xmx2g
-XX:ReservedCodeCacheSize=512m"

kind regards,
Richard

On Tue, Oct 4, 2016 at 10:21 PM, Marco Mistroni  wrote:

> Hi all
>  my mvn build of Spark 2.1 using Java 1.8 is spinning out of memory with
> an error saying it cannot allocate enough memory during maven compilation
>
> Instructions (in the Spark 2.0 page) says that MAVENOPTS are not needed
> for Java 1.8 and , accoding to my understanding, spark build process will
> add it
> during the build via mvn
> Note; i am not using Zinc. Rather, i am using my own Maven version
> (3.3.9), launching this command from the main spark directory. The same
> build works when i use Java 1.7(and MAVENOPTS)
>
> mvn -Pyarn -Dscala-2.11 -DskipTests clean package
>
> Could anyone assist?
> kr
>   marco
>


Re: Best way to calculate intermediate column statistics

2016-08-25 Thread Richard Siebeling
Hi Mich,

thanks for the suggestion, I hadn't thought of that. We'll need to gather
the statistics in two ways, incremental when new data arrives and over the
complete set when aggregating or filtering (because I think it's difficult
to gather statistics while aggregating or filtering).
The analytic functions could help when gathering the statistics over the
whole set,

kind regards,
Richard



On Wed, Aug 24, 2016 at 10:54 PM, Mich Talebzadeh  wrote:

> Hi Richard,
>
> can you use analytics functions for this purpose on DF
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 August 2016 at 21:37, Richard Siebeling 
> wrote:
>
>> Hi Mich,
>>
>> I'd like to gather several statistics per column in order to make
>> analysing data easier. These two statistics are some examples, other
>> statistics I'd like to gather are the variance, the median, several
>> percentiles, etc.  We are building a data analysis platform based on Spark,
>>
>> kind regards,
>> Richard
>>
>> On Wed, Aug 24, 2016 at 6:52 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Richard,
>>>
>>> What is the business use case for such statistics?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 August 2016 at 16:01, Bedrytski Aliaksandr 
>>> wrote:
>>>
>>>> Hi Richard,
>>>>
>>>> these intermediate statistics should be calculated from the result of
>>>> the calculation or during the aggregation?
>>>> If they can be derived from the resulting dataframe, why not to cache
>>>> (persist) that result just after the calculation?
>>>> Then you may aggregate statistics from the cached dataframe.
>>>> This way it won't hit performance too much.
>>>>
>>>> Regards
>>>> --
>>>>   Bedrytski Aliaksandr
>>>>   sp...@bedryt.ski
>>>>
>>>>
>>>>
>>>> On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote:
>>>>
>>>> Hi,
>>>>
>>>> what is the best way to calculate intermediate column statistics like
>>>> the number of empty values and the number of distinct values each column in
>>>> a dataset when aggregating of filtering data next to the actual result of
>>>> the aggregate or the filtered data?
>>>>
>>>> We are developing an application in which the user can slice-and-dice
>>>> through the data and we would like to, next to the actual resulting data,
>>>> get column statistics of each column in the resulting dataset. We prefer to
>>>> calculate the column statistics on the same pass over the data as the
>>>> actual aggregation or filtering, is that possible?
>>>>
>>>> We could sacrifice a little bit of performance (but not too much),
>>>> that's why we prefer one pass...
>>>>
>>>> Is this possible in the standard Spark or would this mean modifying the
>>>> source a little bit and recompiling? Is that feasible / wise to do?
>>>>
>>>> thanks in advance,
>>>> Richard
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Richard Siebeling
Hi Mich,

I'd like to gather several statistics per column in order to make analysing
data easier. These two statistics are some examples, other statistics I'd
like to gather are the variance, the median, several percentiles, etc.  We
are building a data analysis platform based on Spark,

kind regards,
Richard

On Wed, Aug 24, 2016 at 6:52 PM, Mich Talebzadeh 
wrote:

> Hi Richard,
>
> What is the business use case for such statistics?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 August 2016 at 16:01, Bedrytski Aliaksandr  wrote:
>
>> Hi Richard,
>>
>> these intermediate statistics should be calculated from the result of the
>> calculation or during the aggregation?
>> If they can be derived from the resulting dataframe, why not to cache
>> (persist) that result just after the calculation?
>> Then you may aggregate statistics from the cached dataframe.
>> This way it won't hit performance too much.
>>
>> Regards
>> --
>>   Bedrytski Aliaksandr
>>   sp...@bedryt.ski
>>
>>
>>
>> On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote:
>>
>> Hi,
>>
>> what is the best way to calculate intermediate column statistics like the
>> number of empty values and the number of distinct values each column in a
>> dataset when aggregating of filtering data next to the actual result of the
>> aggregate or the filtered data?
>>
>> We are developing an application in which the user can slice-and-dice
>> through the data and we would like to, next to the actual resulting data,
>> get column statistics of each column in the resulting dataset. We prefer to
>> calculate the column statistics on the same pass over the data as the
>> actual aggregation or filtering, is that possible?
>>
>> We could sacrifice a little bit of performance (but not too much), that's
>> why we prefer one pass...
>>
>> Is this possible in the standard Spark or would this mean modifying the
>> source a little bit and recompiling? Is that feasible / wise to do?
>>
>> thanks in advance,
>> Richard
>>
>>
>>
>>
>>
>>
>


Best way to calculate intermediate column statistics

2016-08-24 Thread Richard Siebeling
Hi,

what is the best way to calculate intermediate column statistics like the
number of empty values and the number of distinct values each column in a
dataset when aggregating of filtering data next to the actual result of the
aggregate or the filtered data?

We are developing an application in which the user can slice-and-dice
through the data and we would like to, next to the actual resulting data,
get column statistics of each column in the resulting dataset. We prefer to
calculate the column statistics on the same pass over the data as the
actual aggregation or filtering, is that possible?

We could sacrifice a little bit of performance (but not too much), that's
why we prefer one pass...

Is this possible in the standard Spark or would this mean modifying the
source a little bit and recompiling? Is that feasible / wise to do?

thanks in advance,
Richard


Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
fixed! after adding the option -DskipTests everything build ok.
Thanks Sean for your help

On Thu, Aug 4, 2016 at 8:18 PM, Richard Siebeling 
wrote:

> I don't see any other errors, these are the last lines of the
> make-distribution log.
> Above these lines there are no errors...
>
>
> [INFO] Building jar: /opt/mapr/spark/spark-2.0.0/
> common/network-yarn/target/spark-network-yarn_2.11-2.0.0-test-sources.jar
> [warn] /opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/
> apache/spark/api/python/PythonRDD.scala:78: class Accumulator in package
> spark is deprecated: use AccumulatorV2
> [warn] accumulator: Accumulator[JList[Array[Byte]]])
> [warn]  ^
> [warn] /opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/
> apache/spark/api/python/PythonRDD.scala:71: class Accumulator in package
> spark is deprecated: use AccumulatorV2
> [warn] private[spark] case class PythonFunction(
> [warn]   ^
> [warn] /opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/
> apache/spark/api/python/PythonRDD.scala:873: trait AccumulatorParam in
> package spark is deprecated: use AccumulatorV2
> [warn]   extends AccumulatorParam[JList[Array[Byte]]] {
> [warn]   ^
> [warn] /opt/mapr/spark/spark-2.0.0/core/src/main/scala/org/
> apache/spark/util/AccumulatorV2.scala:459: trait AccumulableParam in
> package spark is deprecated: use AccumulatorV2
> [warn] param: org.apache.spark.AccumulableParam[R, T]) extends
> AccumulatorV2[T, R] {
> [warn] ^
> [warn] four warnings found
> [error] warning: [options] bootstrap class path not set in conjunction
> with -source 1.7
> [error] Compile failed at Aug 3, 2016 2:13:07 AM [1:12.769s]
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
>  3.850 s]
> [INFO] Spark Project Tags . SUCCESS [
>  6.053 s]
> [INFO] Spark Project Sketch ... SUCCESS [
>  9.977 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 17.696 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  8.864 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 17.485 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 19.551 s]
> [INFO] Spark Project Core . FAILURE
> [01:19 min]
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Local Library . SUCCESS [
> 19.594 s]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SUCCESS [
>  6.972 s]
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
> 12.019 s]
> [INFO] Spark Project YARN . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Project External Flume Sink .. SUCCESS [
> 13.460 s]
> [INFO] Spark Project External Flume ... SKIPPED
> [INFO] Spark Project External Flume Assembly .. SKIPPED
> [INFO] Spark Integration for Kafka 0.8  SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 02:08 min (Wall Clock)
> [INFO] Finished at: 2016-08-03T02:13:07+02:00
> [INFO] Final Memory: 54M/844M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-core_2.11: Execution
> scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> failed. CompileFailed -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-ru

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Richard Siebeling
; On Thu, Aug 4, 2016, 07:09 Richard Siebeling  wrote:
>
>> Hi,
>>
>> spark 2.0 with mapr hadoop libraries was succesfully build using the
>> following command:
>> ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0-mapr-1602
>> -DskipTests clean package
>>
>> However when I then try to build a runnable distribution using the
>> following command
>> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.7
>> -Dhadoop.version=2.7.0-mapr-1602
>>
>> It fails with the error "bootstrap class path not set in conjunction
>> with -source 1.7"
>> Could you please help? I do not know what this error means,
>>
>> thanks in advance,
>> Richard
>>
>>
>>


Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-03 Thread Richard Siebeling
Hi,

spark 2.0 with mapr hadoop libraries was succesfully build using the
following command:
./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0-mapr-1602
-DskipTests clean package

However when I then try to build a runnable distribution using the
following command
./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.7
-Dhadoop.version=2.7.0-mapr-1602

It fails with the error "bootstrap class path not set in conjunction with
-source 1.7"
Could you please help? I do not know what this error means,

thanks in advance,
Richard


Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
Well the task itself is completed (it indeed gives a result) but the tasks
in Mesos says killed and it gives an error as Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues.

Kind regards,
Richard

Op maandag 16 mei 2016 heeft Jacek Laskowski > het volgende geschreven:

> On Sun, May 15, 2016 at 5:50 PM, Richard Siebeling 
> wrote:
>
> > I'm getting the following errors running SparkPi on a clean just compiled
> > and checked Mesos 0.29.0 installation with Spark 1.6.1
> >
> > 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> > e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client
> > disassociated. Likely due to containers exceeding thresholds, or network
> > issues. Check driver logs for WARN messages.
>
> Looking at it again and I don't see an issue here? Why do you think it
> doesn't work for you? After Pi is calculated, the executors were taken
> down since the driver finished calculation (and closed SparkContext).
>
> Jacek
>


Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
B.t.w. this is on a single node cluster

Op zondag 15 mei 2016 heeft Richard Siebeling  het
volgende geschreven:

> Hi,
>
> I'm getting the following errors running SparkPi on a clean just compiled
> and checked Mesos 0.29.0 installation with Spark 1.6.1
>
> 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
>
> The Mesos examples are running fine, only the SparkPi example isn't...
> I'm not sure what to do, I thought it had to do with the installation so I
> installed and compiled everything again, but without any good results.
>
> Please help,
> thanks in advance,
> Richard
>
>
> The complete logs are
>
> sudo ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> mesos://192.168.33.10:5050 --deploy-mode client ./lib/spark-examples* 10
>
> 16/05/15 23:05:36 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> I0515 23:05:38.393546 10915 sched.cpp:224] Version: 0.29.0
>
> I0515 23:05:38.402220 10909 sched.cpp:328] New master detected at
> master@192.168.33.10:5050
>
> I0515 23:05:38.403033 10909 sched.cpp:338] No credentials provided.
> Attempting to register without authentication
>
> I0515 23:05:38.431784 10909 sched.cpp:710] Framework registered with
> e23f2d53-22c5-40f0-918d-0d73805fdfec-0006
>
> Pi is roughly 3.145964
>
>
> 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx: Remote RPC client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
>
> 16/05/15 23:05:52 ERROR LiveListenerBus: SparkListenerBus has already
> stopped! Dropping event
> SparkListenerExecutorRemoved(1463346352364,e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0,Remote
> RPC client disassociated. Likely due to containers exceeding thresholds, or
> network issues. Check driver logs for WARN messages.)
>
> I0515 23:05:52.380164 10810 sched.cpp:1921] Asked to stop the driver
>
> I0515 23:05:52.382272 10910 sched.cpp:1150] Stopping framework
> 'e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'
>
> The Mesos sandbox gives the following messages in STDERR
>
> 16/05/15 23:05:52 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 8
>
> 16/05/15 23:05:52 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>
> 16/05/15 23:05:52 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 9
>
> 16/05/15 23:05:52 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
>
> 16/05/15 23:05:52 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Driver commanded a
> shutdown
>
> 16/05/15 23:05:52 INFO MemoryStore: MemoryStore cleared
>
> 16/05/15 23:05:52 INFO BlockManager: BlockManager stopped
>
> 16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
>
> 16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Remote
> daemon shut down; proceeding with flushing remote transports.
>
> 16/05/15 23:05:52 WARN CoarseGrainedExecutorBackend: An unknown
> (anabrix:45663) driver disconnected.
>
> 16/05/15 23:05:52 ERROR CoarseGrainedExecutorBackend: Driver
> 192.168.33.10:45663 disassociated! Shutting down.
>
> I0515 23:05:52.388991 11120 exec.cpp:399] Executor asked to shutdown
>
> 16/05/15 23:05:52 INFO ShutdownHookManager: Shutdown hook called
>
> 16/05/15 23:05:52 INFO ShutdownHookManager: Deleting directory
> /tmp/mesos/slaves/e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/frameworks/e23f2d53-22c5-40f0-918d-0d73805fdfec-0006/executors/0/runs/b9df4275-a597-4b8e-9a7b-45e7fb79bd93/spark-a99d0380-2d0d-4bbd-a593-49ad885e5430
>
> And the following messages in STDOUT
>
> Registered executor on xxx
>
> Starting task 0
>
> sh -c 'cd spark-1*;  ./bin/spark-class
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
> CoarseGrainedScheduler@192.168.33.10:45663 --executor-id
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 --hostname xxx --cores 1 --app-id
> e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'
>
> Forked command at 11124
>
> Shutting down
>
> Sending SIGTERM to process tree at pid 11124
>
> 

Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
Hi,

I'm getting the following errors running SparkPi on a clean just compiled
and checked Mesos 0.29.0 installation with Spark 1.6.1

16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.

The Mesos examples are running fine, only the SparkPi example isn't...
I'm not sure what to do, I thought it had to do with the installation so I
installed and compiled everything again, but without any good results.

Please help,
thanks in advance,
Richard


The complete logs are

sudo ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
mesos://192.168.33.10:5050 --deploy-mode client ./lib/spark-examples* 10

16/05/15 23:05:36 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

I0515 23:05:38.393546 10915 sched.cpp:224] Version: 0.29.0

I0515 23:05:38.402220 10909 sched.cpp:328] New master detected at
master@192.168.33.10:5050

I0515 23:05:38.403033 10909 sched.cpp:338] No credentials provided.
Attempting to register without authentication

I0515 23:05:38.431784 10909 sched.cpp:710] Framework registered with
e23f2d53-22c5-40f0-918d-0d73805fdfec-0006

Pi is roughly 3.145964


16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx: Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.

16/05/15 23:05:52 ERROR LiveListenerBus: SparkListenerBus has already
stopped! Dropping event
SparkListenerExecutorRemoved(1463346352364,e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0,Remote
RPC client disassociated. Likely due to containers exceeding thresholds, or
network issues. Check driver logs for WARN messages.)

I0515 23:05:52.380164 10810 sched.cpp:1921] Asked to stop the driver

I0515 23:05:52.382272 10910 sched.cpp:1150] Stopping framework
'e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'

The Mesos sandbox gives the following messages in STDERR

16/05/15 23:05:52 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
1029 bytes result sent to driver

16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 8

16/05/15 23:05:52 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)

16/05/15 23:05:52 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8).
1029 bytes result sent to driver

16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 9

16/05/15 23:05:52 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)

16/05/15 23:05:52 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9).
1029 bytes result sent to driver

16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Driver commanded a
shutdown

16/05/15 23:05:52 INFO MemoryStore: MemoryStore cleared

16/05/15 23:05:52 INFO BlockManager: BlockManager stopped

16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.

16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.

16/05/15 23:05:52 WARN CoarseGrainedExecutorBackend: An unknown
(anabrix:45663) driver disconnected.

16/05/15 23:05:52 ERROR CoarseGrainedExecutorBackend: Driver
192.168.33.10:45663 disassociated! Shutting down.

I0515 23:05:52.388991 11120 exec.cpp:399] Executor asked to shutdown

16/05/15 23:05:52 INFO ShutdownHookManager: Shutdown hook called

16/05/15 23:05:52 INFO ShutdownHookManager: Deleting directory
/tmp/mesos/slaves/e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/frameworks/e23f2d53-22c5-40f0-918d-0d73805fdfec-0006/executors/0/runs/b9df4275-a597-4b8e-9a7b-45e7fb79bd93/spark-a99d0380-2d0d-4bbd-a593-49ad885e5430

And the following messages in STDOUT

Registered executor on xxx

Starting task 0

sh -c 'cd spark-1*;  ./bin/spark-class
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
CoarseGrainedScheduler@192.168.33.10:45663 --executor-id
e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 --hostname xxx --cores 1 --app-id
e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'

Forked command at 11124

Shutting down

Sending SIGTERM to process tree at pid 11124

Sent SIGTERM to the following process trees:

[

-+- 11124 sh -c cd spark-1*;  ./bin/spark-class
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
CoarseGrainedScheduler@192.168.33.10:45663 --executor-id
e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 --hostname xxx --cores 1 --app-id
e23f2d53-22c5-40f0-918d-0d73805fdfec-0006

 \--- 11125
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.91-0.b14.el7_2.x86_64/jre/bin/java
-cp
/tmp/mesos/slaves/e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/frameworks/e23f2d53-22c5-40f0-918d-0d73805fdfec-0006/executors/0/runs/b9df4275-a597-4b8e-9a7b-45e7fb79bd93/spark-1.6.1-bin-hadoop2.6/conf/:/tmp/mesos/slaves/e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/frameworks/e23f2d53-22

Re: Split columns in RDD

2016-01-19 Thread Richard Siebeling
thanks Daniel, this will certainly help,
regards, Richard

On Tue, Jan 19, 2016 at 6:35 PM, Daniel Imberman 
wrote:

> edit 2: filter should be map
>
> val numColumns = separatedInputStrings.map{ case(id, (stateList,
> numStates)) => numStates}.reduce(math.max)
>
> On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman 
> wrote:
>
>> edit: Mistake in the second code example
>>
>> val numColumns = separatedInputStrings.filter{ case(id, (stateList,
>> numStates)) => numStates}.reduce(math.max)
>>
>>
>> On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> Hi Richard,
>>>
>>> If I understand the question correctly it sounds like you could probably
>>> do this using mapValues (I'm assuming that you want two pieces of
>>> information out of all rows, the states as individual items, and the number
>>> of states in the row)
>>>
>>>
>>> val separatedInputStrings = input:RDD[(Int, String).mapValues{
>>> val inputsString = "TX,NV,WY"
>>> val stringList = inputString.split(",")
>>> (stringList, stringList.size)
>>> }
>>>
>>> If you then wanted to find out how many state columns you should have in
>>> your table you could use a normal reduce (with a filter beforehand to
>>> reduce how much data you are shuffling)
>>>
>>> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max)
>>>
>>> I hope this helps!
>>>
>>>
>>>
>>> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling 
>>> wrote:
>>>
>>>> that's true and that's the way we're doing it now but then we're only
>>>> using the first row to determine the number of splitted columns.
>>>> It could be that in the second (or last) row there are 10 new columns
>>>> and we'd like to know that too.
>>>>
>>>> Probably a reduceby operator can be used to do that, but I'm hoping
>>>> that there is a better or another way,
>>>>
>>>> thanks,
>>>> Richard
>>>>
>>>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
>>>> sabarish.sasidha...@manthan.com> wrote:
>>>>
>>>>> The most efficient to determine the number of columns would be to do a
>>>>> take(1) and split in the driver.
>>>>>
>>>>> Regards
>>>>> Sab
>>>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> what is the most efficient way to split columns and know how many
>>>>>> columns are created.
>>>>>>
>>>>>> Here is the current RDD
>>>>>> -
>>>>>> ID   STATE
>>>>>> -
>>>>>> 1   TX, NY, FL
>>>>>> 2   CA, OH
>>>>>> -
>>>>>>
>>>>>> This is the preferred output:
>>>>>> -
>>>>>> IDSTATE_1 STATE_2  STATE_3
>>>>>> -
>>>>>> 1 TX  NY  FL
>>>>>> 2 CA  OH
>>>>>> -
>>>>>>
>>>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>>>>>
>>>>>>
>>>>>> It looks like the following output is feasible using a ReduceBy
>>>>>> operator
>>>>>> -
>>>>>> IDSTATE_1 STATE_2  STATE_3   NEW_COLUMNS
>>>>>> -
>>>>>> 1 TXNY   FLSTATE_1,
>>>>>> STATE_2, STATE_3
>>>>>> 2 CAOH STATE_1,
>>>>>> STATE_2
>>>>>> -
>>>>>>
>>>>>> Then in the reduce step, the distinct new columns can be calculated.
>>>>>> Is it possible to get the second output where next to the RDD the
>>>>>> new_columns are saved somewhere?
>>>>>> Or is the required to use the second approach?
>>>>>>
>>>>>> thanks in advance,
>>>>>> Richard
>>>>>>
>>>>>>
>>>>


Re: Split columns in RDD

2016-01-19 Thread Richard Siebeling
that's true and that's the way we're doing it now but then we're only using
the first row to determine the number of splitted columns.
It could be that in the second (or last) row there are 10 new columns and
we'd like to know that too.

Probably a reduceby operator can be used to do that, but I'm hoping that
there is a better or another way,

thanks,
Richard

On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> The most efficient to determine the number of columns would be to do a
> take(1) and split in the driver.
>
> Regards
> Sab
> On 19-Jan-2016 8:48 pm, "Richard Siebeling"  wrote:
>
>> Hi,
>>
>> what is the most efficient way to split columns and know how many columns
>> are created.
>>
>> Here is the current RDD
>> -
>> ID   STATE
>> -
>> 1   TX, NY, FL
>> 2   CA, OH
>> -
>>
>> This is the preferred output:
>> -
>> IDSTATE_1 STATE_2  STATE_3
>> -
>> 1 TX  NY  FL
>> 2 CA  OH
>> -
>>
>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>
>>
>> It looks like the following output is feasible using a ReduceBy operator
>> -
>> IDSTATE_1 STATE_2  STATE_3   NEW_COLUMNS
>> -
>> 1 TXNY   FLSTATE_1, STATE_2,
>> STATE_3
>> 2 CAOH STATE_1, STATE_2
>> -
>>
>> Then in the reduce step, the distinct new columns can be calculated.
>> Is it possible to get the second output where next to the RDD the
>> new_columns are saved somewhere?
>> Or is the required to use the second approach?
>>
>> thanks in advance,
>> Richard
>>
>>


Split columns in RDD

2016-01-19 Thread Richard Siebeling
Hi,

what is the most efficient way to split columns and know how many columns
are created.

Here is the current RDD
-
ID   STATE
-
1   TX, NY, FL
2   CA, OH
-

This is the preferred output:
-
IDSTATE_1 STATE_2  STATE_3
-
1 TX  NY  FL
2 CA  OH
-

With a separated with the new columns STATE_1, STATE_2, STATE_3


It looks like the following output is feasible using a ReduceBy operator
-
IDSTATE_1 STATE_2  STATE_3   NEW_COLUMNS
-
1 TXNY   FLSTATE_1, STATE_2,
STATE_3
2 CAOH STATE_1, STATE_2
-

Then in the reduce step, the distinct new columns can be calculated.
Is it possible to get the second output where next to the RDD the
new_columns are saved somewhere?
Or is the required to use the second approach?

thanks in advance,
Richard


Stacking transformations and using intermediate results in the next transformation

2016-01-15 Thread Richard Siebeling
Hi,

we're stacking multiple RDD operations on each other, for example as a
source we have a RDD[List[String]] like

["a", "b, c", "d"]
["a", "d, a", "d"]

In the first step we split the second column in two columns, in the next
step we filter the data on column 3 = "c" and in the final step we're doing
something else. The point is that it needs to be flexible (the user adds
custom transformations and they are stacked on top of each other like the
example above, which transformations are added by the user is therefor not
known upfront).

The transformations itself are no problems but we want to keep track of the
added columns, the dropped columns and the updated columns. In the example
above, the second column is dropped and two new columns are added.

The intermediate result here will be

["a", "b", "c", "d"]
["a", "d", "a", "d"]

And the final result will be

["a", "b", "c", "d"]

What I would like to know is after each transformation which columns are
added, which colunms are dropped and which ones are updated.This is
information that's needed to execute the next transformation.

I was thinking of two possible scenario's:

1. capture the metadata and store that in the RDD, effectively creating a
RDD[List[String], List[Column], List[Column], List[Column]) object. Where
the last three List[Column] contain the new, dropped or updated columns.
This will result in an RDD with a lot of extra information on each row.
That information is not needed on each row but rather one time for the
whole split transformation

2. use accumulators to store the new, updated and dropped columns. But I
don't think this is feasible

Are there any better scenario's or how could I accomplish such a scenario?

thanks in advance,
Richard


Re: ROSE: Spark + R on the JVM.

2016-01-13 Thread Richard Siebeling
Hi David,

the use case is that we're building a data processing system with an
intuitive user interface where Spark is used as the data processing
framework.
We would like to provide a HTML user interface to R where the user types or
copy-pastes his R code, the system should then send this R code (using
ROSE) to R, process it and give the results back to the user. The RDD would
be used so that the data can be further processed by the system but we
would like to also show or be able to show the messages printed to STDOUT
and also the images (plots) that are generated by R. The plots seems to be
available in the OpenCPU API, see below

[image: Inline image 1]

So the case is not that we're trying to process millions of images but
rather that we would like to show the generated plots (like a regression
plot) that's generated in R to the user. There could be several plots
generated by the code, but certainly not thousands or even hundreds, only a
few.

Hope that this would be possible using ROSE because it seems a really good
fit,
thanks in advance,
Richard

On Wed, Jan 13, 2016 at 3:39 AM, David Russell <
themarchoffo...@protonmail.com> wrote:

> Hi Richard,
>
> > Would it be possible to access the session API from within ROSE,
> > to get for example the images that are generated by R / openCPU
>
> Technically it would be possible although there would be some potentially
> significant runtime costs per task in doing so, primarily those related to
> extracting image data from the R session, serializing and then moving that
> data across the cluster for each and every image.
>
> From a design perspective ROSE was intended to be used within Spark scale
> applications where R object data was seen as the primary task output. An
> output in a format that could be rapidly serialized and easily processed.
> Are there real world use cases where Spark scale applications capable of
> generating 10k, 100k, or even millions of image files would actually
> need to capture and store images? If so, how practically speaking, would
> these images ever be used? I'm just not sure. Maybe you could describe your
> own use case to provide some insights?
>
> > and the logging to stdout that is logged by R?
>
> If you are referring to the R console output (generated within the R
> session during the execution of an OCPUTask) then this data could certainly
> (optionally) be captured and returned on an OCPUResult. Again, can you
> provide any details for how you might use this console output in a real
> world application?
>
> As an aside, for simple standalone Spark applications that will only ever
> run on a single host (no cluster) you could consider using an alternative
> library called *fluent-r*. This library is also available under my GitHub
> repo, see here . The fluent-r
> library already has support for the retrieval of R objects, R console
> output and R graphics device image/plots. However it is not as lightweight
> as ROSE and it not designed to work in a clustered environment. ROSE on the
> other hand is designed for scale.
>
> David
>
> "All that is gold does not glitter, Not all those who wander are lost."
>
>
>  Original Message 
> Subject: Re: ROSE: Spark + R on the JVM.
> Local Time: January 12 2016 6:56 pm
> UTC Time: January 12 2016 11:56 pm
> From: rsiebel...@gmail.com
> To: m...@vijaykiran.com
> CC: cjno...@gmail.com,themarchoffo...@protonmail.com,user@spark.apache.org
> ,d...@spark.apache.org
>
> Hi,
>
> this looks great and seems to be very usable.
> Would it be possible to access the session API from within ROSE, to get
> for example the images that are generated by R / openCPU and the logging to
> stdout that is logged by R?
>
> thanks in advance,
> Richard
>
> On Tue, Jan 12, 2016 at 10:16 PM, Vijay Kiran  wrote:
>
>> I think it would be this:
>> https://github.com/onetapbeyond/opencpu-spark-executor
>>
>> > On 12 Jan 2016, at 18:32, Corey Nolet  wrote:
>> >
>> > David,
>> >
>> > Thank you very much for announcing this! It looks like it could be very
>> useful. Would you mind providing a link to the github?
>> >
>> > On Tue, Jan 12, 2016 at 10:03 AM, David 
>> wrote:
>> > Hi all,
>> >
>> > I'd like to share news of the recent release of a new Spark package,
>> ROSE.
>> >
>> > ROSE is a Scala library offering access to the full scientific
>> computing power of the R programming language to Apache Spark batch and
>> streaming applications on the JVM. Where Apache SparkR lets data scientists
>> use Spark from R, ROSE is designed to let Scala and Java developers use R
>> from Spark.
>> >
>> > The project is available and documented on GitHub and I would encourage
>> you to take a look. Any feedback, questions etc very welcome.
>> >
>> > David
>> >
>> > "All that is gold does not glitter, Not all those who wander are lost."
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Richard Siebeling
Hi,

this looks great and seems to be very usable.
Would it be possible to access the session API from within ROSE, to get for
example the images that are generated by R / openCPU and the logging to
stdout that is logged by R?

thanks in advance,
Richard

On Tue, Jan 12, 2016 at 10:16 PM, Vijay Kiran  wrote:

> I think it would be this:
> https://github.com/onetapbeyond/opencpu-spark-executor
>
> > On 12 Jan 2016, at 18:32, Corey Nolet  wrote:
> >
> > David,
> >
> > Thank you very much for announcing this! It looks like it could be very
> useful. Would you mind providing a link to the github?
> >
> > On Tue, Jan 12, 2016 at 10:03 AM, David 
> wrote:
> > Hi all,
> >
> > I'd like to share news of the recent release of a new Spark package,
> ROSE.
> >
> > ROSE is a Scala library offering access to the full scientific computing
> power of the R programming language to Apache Spark batch and streaming
> applications on the JVM. Where Apache SparkR lets data scientists use Spark
> from R, ROSE is designed to let Scala and Java developers use R from Spark.
> >
> > The project is available and documented on GitHub and I would encourage
> you to take a look. Any feedback, questions etc very welcome.
> >
> > David
> >
> > "All that is gold does not glitter, Not all those who wander are lost."
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: combining operations elegantly

2014-03-24 Thread Richard Siebeling
Hi guys,

thanks for the information, I'll give it a try with Algebird,
thanks again,
Richard

@Patrick, thanks for the release calendar


On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell wrote:

> Hey All,
>
> I think the old thread is here:
> https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J
>
> The method proposed in that thread is to create a utility class for
> doing single-pass aggregations. Using Algebird is a pretty good way to
> do this and is a bit more flexible since you don't need to create a
> new utility each time you want to do this.
>
> In Spark 1.0 and later you will be able to do this more elegantly with
> the schema support:
> myRDD.groupBy('user).select(Sum('clicks) as 'clicks,
> Average('duration) as 'duration)
>
> and it will use a single pass automatically... but that's not quite
> released yet :)
>
> - Patrick
>
>
>
>
> On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers  wrote:
> > i currently typically do something like this:
> >
> > scala> val rdd = sc.parallelize(1 to 10)
> > scala> import com.twitter.algebird.Operators._
> > scala> import com.twitter.algebird.{Max, Min}
> > scala> rdd.map{ x => (
> >  |   1L,
> >  |   Min(x),
> >  |   Max(x),
> >  |   x
> >  | )}.reduce(_ + _)
> > res0: (Long, com.twitter.algebird.Min[Int],
> com.twitter.algebird.Max[Int],
> > Int) = (10,Min(1),Max(10),55)
> >
> > however for this you need twitter algebird dependency. without that you
> have
> > to code the reduce function on the tuples yourself...
> >
> > another example with 2 columns, where i do conditional count for first
> > column, and simple sum for second:
> > scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
> >  |   if (x > 5) 1 else 0,
> >  |   y
> >  | )}.reduce(_ + _)
> > res3: (Int, Int) = (5,155)
> >
> >
> >
> > On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling  >
> > wrote:
> >>
> >> Hi Koert, Patrick,
> >>
> >> do you already have an elegant solution to combine multiple operations
> on
> >> a single RDD?
> >> Say for example that I want to do a sum over one column, a count and an
> >> average over another column,
> >>
> >> thanks in advance,
> >> Richard
> >>
> >>
> >> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <
> rsiebel...@gmail.com>
> >> wrote:
> >>>
> >>> Patrick, Koert,
> >>>
> >>> I'm also very interested in these examples, could you please post them
> if
> >>> you find them?
> >>> thanks in advance,
> >>> Richard
> >>>
> >>>
> >>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers 
> wrote:
> >>>>
> >>>> not that long ago there was a nice example on here about how to
> combine
> >>>> multiple operations on a single RDD. so basically if you want to do a
> >>>> count() and something else, how to roll them into a single job. i
> think
> >>>> patrick wendell gave the examples.
> >>>>
> >>>> i cant find them anymore patrick can you please repost? thanks!
> >>>
> >>>
> >>
> >
>


Re: combining operations elegantly

2014-03-23 Thread Richard Siebeling
Hi Koert, Patrick,

do you already have an elegant solution to combine multiple operations on a
single RDD?
Say for example that I want to do a sum over one column, a count and an
average over another column,

thanks in advance,
Richard


On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling wrote:

> Patrick, Koert,
>
> I'm also very interested in these examples, could you please post them if
> you find them?
> thanks in advance,
> Richard
>
>
> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers  wrote:
>
>> not that long ago there was a nice example on here about how to combine
>> multiple operations on a single RDD. so basically if you want to do a
>> count() and something else, how to roll them into a single job. i think
>> patrick wendell gave the examples.
>>
>> i cant find them anymore patrick can you please repost? thanks!
>>
>
>


Re: combining operations elegantly

2014-03-17 Thread Richard Siebeling
Patrick, Koert,

I'm also very interested in these examples, could you please post them if
you find them?
thanks in advance,
Richard


On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers  wrote:

> not that long ago there was a nice example on here about how to combine
> multiple operations on a single RDD. so basically if you want to do a
> count() and something else, how to roll them into a single job. i think
> patrick wendell gave the examples.
>
> i cant find them anymore patrick can you please repost? thanks!
>