date:20140906

How to change the values in Array of Bytes

2014-09-06 Thread Deep Pradhan

Hi,
I have an array of bytes and I have filled the array with 0 in all the
postitions.


*var Array = Array.fill[Byte](10)(0)*
Now, if certain conditions are satisfied, I want to change some elements of
the array to 1 instead of 0. If I run,



*if (Array.apply(index)==0) Array.apply(index) = 1*
it returns me an error.

But if I assign *Array.apply(index) *to a variable and do the same thing
then it works. I do not want to assign this to variables because if I do
this, I would be creating a lot of variables.

Can anyone tell me a method to do this?

Thank You

Re: Support R in Spark

2014-09-06 Thread oppokui

Cool! It is a very good news. Can’t wait for it.

Kui 

> On Sep 5, 2014, at 1:58 AM, Shivaram Venkataraman 
>  wrote:
> 
> Thanks Kui. SparkR is a pretty young project, but there are a bunch of
> things we are working on. One of the main features is to expose a data
> frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
> be integrating this with Spark's MLLib.  At a high-level this will
> allow R users to use a familiar API but make use of MLLib's efficient
> distributed implementation. This is the same strategy used in Python
> as well.
> 
> Also we do hope to merge SparkR with mainline Spark -- we have a few
> features to complete before that and plan to shoot for integration by
> Spark 1.3.
> 
> Thanks
> Shivaram
> 
> On Wed, Sep 3, 2014 at 9:24 PM, oppokui  wrote:
>> Thanks, Shivaram.
>> 
>> No specific use case yet. We try to use R in our project as data scientest
>> are all knowing R. We had a concern that how R handles the mass data. Spark
>> does a better work on big data area, and Spark ML is focusing on predictive
>> analysis area. Then we are thinking whether we can merge R and Spark
>> together. We tried SparkR and it is pretty easy to use. But we didn’t see
>> any feedback on this package in industry. It will be better if Spark team
>> has R support just like scala/Java/Python.
>> 
>> Another question is that MLlib will re-implement all famous data mining
>> algorithms in Spark, then what is the purpose of using R?
>> 
>> There is another technique for us H2O which support R natively. H2O is more
>> friendly to data scientist. I saw H2O can also work on Spark (Sparkling
>> Water).  It is better than using SparkR?
>> 
>> Thanks and Regards.
>> 
>> Kui
>> 
>> 
>> On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
>>  wrote:
>> 
>> Hi
>> 
>> Do you have a specific use-case where SparkR doesn't work well ? We'd love
>> to hear more about use-cases and features that can be improved with SparkR.
>> 
>> Thanks
>> Shivaram
>> 
>> 
>> On Wed, Sep 3, 2014 at 3:19 AM, oppokui  wrote:
>>> 
>>> Does spark ML team have plan to support R script natively? There is a
>>> SparkR project, but not from spark team. Spark ML used netlib-java to talk
>>> with native fortran routines or use NumPy, why not try to use R in some
>>> sense.
>>> 
>>> R had lot of useful packages. If spark ML team can include R support, it
>>> will be a very powerful.
>>> 
>>> Any comment?
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>> 
>> 
>> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: error: type mismatch while Union

2014-09-06 Thread Dhimant

I am using Spark version 1.0.2




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13618.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to change the values in Array of Bytes

2014-09-06 Thread Aaron Davidson

More of a Scala question than Spark, but "apply" here can be written with
just parentheses like this:

val array = Array.fill[Byte](10)(0)
if (array(index) == 0) {
  array(index) = 1
}

The second instance of "array(index) = 1" is actually not calling apply,
but "update". It's a scala-ism that's usually invisible. The above is
equivalent to:

if (array.apply(index) == 0) {
  array.update(index, 1)
}



On Sat, Sep 6, 2014 at 2:09 AM, Deep Pradhan 
wrote:

> Hi,
> I have an array of bytes and I have filled the array with 0 in all the
> postitions.
>
>
> *var Array = Array.fill[Byte](10)(0)*
> Now, if certain conditions are satisfied, I want to change some elements
> of the array to 1 instead of 0. If I run,
>
>
>
> *if (Array.apply(index)==0) Array.apply(index) = 1*
> it returns me an error.
>
> But if I assign *Array.apply(index) *to a variable and do the same thing
> then it works. I do not want to assign this to variables because if I do
> this, I would be creating a lot of variables.
>
> Can anyone tell me a method to do this?
>
> Thank You
>
>
>

Re: error: type mismatch while Union

2014-09-06 Thread Aaron Davidson

Are you doing this from the spark-shell? You're probably running into
https://issues.apache.org/jira/browse/SPARK-1199 which should be fixed in
1.1.

On Sat, Sep 6, 2014 at 3:03 AM, Dhimant  wrote:

> I am using Spark version 1.0.2
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13618.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: question on replicate() in blockManager.scala

2014-09-06 Thread Aaron Davidson

Looks like that's BlockManagerWorker.syncPutBlock(), which is in an if
check, perhaps obscuring its existence.


On Fri, Sep 5, 2014 at 2:19 AM, rapelly kartheek 
wrote:

> Hi,
>
> var cachedPeers: Seq[BlockManagerId] = null
>   private def replicate(blockId: String, data: ByteBuffer, level:
> StorageLevel) {
> val tLevel = StorageLevel(level.useDisk, level.useMemory,
> level.deserialized, 1)
> if (cachedPeers == null) {
>   cachedPeers = master.getPeers(blockManagerId, level.replication - 1)
> }
> for (peer: BlockManagerId <- cachedPeers) {
>   val start = System.nanoTime
>   data.rewind()
>   logDebug("Try to replicate BlockId " + blockId + " once; The size of
> the data is "
> + data.limit() + " Bytes. To node: " + peer)
>   if (!BlockManagerWorker.syncPutBlock(PutBlock(blockId, data,
> tLevel),
> new ConnectionManagerId(peer.host, peer.port))) {
> logError("Failed to call syncPutBlock to " + peer)
>   }
>   logDebug("Replicated BlockId " + blockId + " once used " +
> (System.nanoTime - start) / 1e6 + " s; The size of the data is " +
> data.limit() + " bytes.")
> }
>
>
> I get the flow of this code. But, I dont find any method being called for
> actually writing the data into the set of peers chosen for replication.
>
> Where exaclty is the replication happening?
>
> Thank you!!
> -Karthik
>

Re: Task not serializable

2014-09-06 Thread Sean Owen

I disagree that the generally right change is to try to make the
classes serializable. Usually, classes that are not serializable are
not supposed to be serialized. You're using them in a way that's
causing them to be serialized, and that's probably not desired.

For example, this is wrong:

val foo: SomeUnserializableManagerClass = ...
rdd.map(d => foo.bar(d))

This is right:

rdd.map { d =>
  val foo: SomeUnserializableManagerClass = ...
  foo.bar(d)
}

In the first instance, you create the object on the driver and try to
serialize and copy it to workers. In the second, you're creating
SomeUnserializableManagerClass in the function and therefore on the
worker.

mapPartitions is better if this creation is expensive.

On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
 wrote:
> Hi,
>
> I'm trying to migrate a map-reduce program to work with spark. I migrated
> the program from Java to Scala. The map-reduce program basically loads a
> HDFS file and for each line in the file it applies several transformation
> functions available in various external libraries.
>
> When I execute this over spark, it is throwing me "Task not serializable"
> exceptions for each and every class being used from these from external
> libraries. I included serialization to few classes which are in my scope,
> but there there are several other classes which are out of my scope like
> org.apache.hadoop.io.Text.
>
> How to overcome these exceptions?
>
> ~Sarath.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Getting the type of an RDD in spark AND pyspark

2014-09-06 Thread Aaron Davidson

Pretty easy to do in Scala:

rdd.elementClassTag.runtimeClass

You can access this method from Python as well by using the internal _jrdd.
It would look something like this (warning, I have not tested it):
rdd._jrdd.classTag().runtimeClass()

(The method name is "classTag" for JavaRDDLike, and "elementClassTag" for
Scala's RDD.)


On Thu, Sep 4, 2014 at 1:32 PM, esamanas  wrote:

> Hi,
>
> I'm new to spark and scala, so apologies if this is obvious.
>
> Every RDD appears to be typed, which I can see by seeing the output in the
> spark-shell when I execute 'take':
>
> scala> val t = sc.parallelize(Array(1,2,3))
> t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize
> at :12
>
> scala> t.take(3)
> res4: Array[Int] = Array(1, 2, 3)
>
>
> scala> val u = sc.parallelize(Array(1,Array(2,2,2,2,2),3))
> u: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[3] at parallelize
> at :12
>
> scala> u.take(3)
> res5: Array[Any] = Array(1, Array(2, 2, 2, 2, 2), 3)
>
> Array type stays the same even if only one type returned.
> scala> u.take(1)
> res6: Array[Any] = Array(1)
>
>
> Is there some way to just get the name of the type of the entire RDD from
> some function call?  Also, I would really like this same functionality in
> pyspark, so I'm wondering if that exists on that side, since clearly the
> underlying RDD is typed (I'd be fine with either the Scala or Python type
> name).
>
> Thank you,
>
> Evan
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-type-of-an-RDD-in-spark-AND-pyspark-tp13498.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

unsubscribe

2014-09-06 Thread Murali Raju

Re: unsubscribe

2014-09-06 Thread Derek Schoettle


Unsubscribe

> On Sep 6, 2014, at 7:48 AM, "Murali Raju" 
wrote:
>
>

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen

Hi Kui,

DDF (open sourced) also aims to do something similar, adding RDBMS idioms,
and is already implemented on top of Spark.

One philosophy is that the DDF API aggressively hides the notion of
parallel datasets, exposing only (mutable) tables to users, on which they
can apply R and other familiar data mining/machine learning idioms, without
having to know about the distributed representation underneath. Now, you
can get to the underlying RDDs if you want to, simply by asking for it.

This was launched at the July Spark Summit. See
http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
.

Sent while mobile. Please excuse typos etc.
On Sep 4, 2014 1:59 PM, "Shivaram Venkataraman" 
wrote:

> Thanks Kui. SparkR is a pretty young project, but there are a bunch of
> things we are working on. One of the main features is to expose a data
> frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
> be integrating this with Spark's MLLib.  At a high-level this will
> allow R users to use a familiar API but make use of MLLib's efficient
> distributed implementation. This is the same strategy used in Python
> as well.
>
> Also we do hope to merge SparkR with mainline Spark -- we have a few
> features to complete before that and plan to shoot for integration by
> Spark 1.3.
>
> Thanks
> Shivaram
>
> On Wed, Sep 3, 2014 at 9:24 PM, oppokui  wrote:
> > Thanks, Shivaram.
> >
> > No specific use case yet. We try to use R in our project as data
> scientest
> > are all knowing R. We had a concern that how R handles the mass data.
> Spark
> > does a better work on big data area, and Spark ML is focusing on
> predictive
> > analysis area. Then we are thinking whether we can merge R and Spark
> > together. We tried SparkR and it is pretty easy to use. But we didn’t see
> > any feedback on this package in industry. It will be better if Spark team
> > has R support just like scala/Java/Python.
> >
> > Another question is that MLlib will re-implement all famous data mining
> > algorithms in Spark, then what is the purpose of using R?
> >
> > There is another technique for us H2O which support R natively. H2O is
> more
> > friendly to data scientist. I saw H2O can also work on Spark (Sparkling
> > Water).  It is better than using SparkR?
> >
> > Thanks and Regards.
> >
> > Kui
> >
> >
> > On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
> >  wrote:
> >
> > Hi
> >
> > Do you have a specific use-case where SparkR doesn't work well ? We'd
> love
> > to hear more about use-cases and features that can be improved with
> SparkR.
> >
> > Thanks
> > Shivaram
> >
> >
> > On Wed, Sep 3, 2014 at 3:19 AM, oppokui  wrote:
> >>
> >> Does spark ML team have plan to support R script natively? There is a
> >> SparkR project, but not from spark team. Spark ML used netlib-java to
> talk
> >> with native fortran routines or use NumPy, why not try to use R in some
> >> sense.
> >>
> >> R had lot of useful packages. If spark ML team can include R support, it
> >> will be a very powerful.
> >>
> >> Any comment?
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Task not serializable

2014-09-06 Thread Sarath Chandra

Thanks Alok, Sean.

As suggested by Sean, I tried a sample program. I wrote a function in which
I made a reference to a class from third party library that is not
serialized and passed it to my map function. On executing I got same
exception.

Then I modified the program removed function and written it's contents as
anonymous function inside map function. This time the execution succeeded.

I understood the explanation of Sean. But request for references to a more
detailed explanation and examples for writing efficient spark programs
avoiding such pitfalls.

~Sarath
 On 06-Sep-2014 4:32 pm, "Sean Owen"  wrote:

> I disagree that the generally right change is to try to make the
> classes serializable. Usually, classes that are not serializable are
> not supposed to be serialized. You're using them in a way that's
> causing them to be serialized, and that's probably not desired.
>
> For example, this is wrong:
>
> val foo: SomeUnserializableManagerClass = ...
> rdd.map(d => foo.bar(d))
>
> This is right:
>
> rdd.map { d =>
>   val foo: SomeUnserializableManagerClass = ...
>   foo.bar(d)
> }
>
> In the first instance, you create the object on the driver and try to
> serialize and copy it to workers. In the second, you're creating
> SomeUnserializableManagerClass in the function and therefore on the
> worker.
>
> mapPartitions is better if this creation is expensive.
>
> On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
>  wrote:
> > Hi,
> >
> > I'm trying to migrate a map-reduce program to work with spark. I migrated
> > the program from Java to Scala. The map-reduce program basically loads a
> > HDFS file and for each line in the file it applies several transformation
> > functions available in various external libraries.
> >
> > When I execute this over spark, it is throwing me "Task not serializable"
> > exceptions for each and every class being used from these from external
> > libraries. I included serialization to few classes which are in my scope,
> > but there there are several other classes which are out of my scope like
> > org.apache.hadoop.io.Text.
> >
> > How to overcome these exceptions?
> >
> > ~Sarath.
>

Re: How spark parallelize maps Slices to tasks/executors/workers

2014-09-06 Thread Matthew Farrellee


On 09/04/2014 09:55 PM, Mozumder, Monir wrote:

I have this 2-node cluster setup, where each node has 4-cores.

 MASTER

 (Worker-on-master)  (Worker-on-node1)

(slaves(master,node1))

SPARK_WORKER_INSTANCES=1

I am trying to understand Spark's parallelize behavior. The sparkPi
example has this code:

 val slices = 8

 val n = 10 * slices

 val count = spark.parallelize(1 to n, slices).map { i =>

   val x = random * 2 - 1

   val y = random * 2 - 1

   if (x*x + y*y < 1) 1 else 0

 }.reduce(_ + _)

As per documentation: "Spark will run one task for each slice of the
cluster. Typically you want 2-4 slices for each CPU in your cluster". I
set slices to be 8 which means the workingset will be divided among 8
tasks on the cluster, in turn each worker node gets 4 tasks (1:1 per core)

Questions:

i)  Where can I see task level details? Inside executors I dont see
task breakdown so I can see the effect of slices on the UI.


under http://localhost:4040/stages/ you can drill into individual stages 
to see task details




ii) How to  programmatically find the working set size for the map
function above? I assume it is n/slices (10 above)


it'll be roughly n/slices. you can mapPqrtitions() and check their length



iii) Are the multiple tasks run by an executor run sequentially or
paralelly in multiple threads?


parallel. have a look at 
https://spark.apache.org/docs/latest/cluster-overview.html




iv) Reasoning behind 2-4 slices per CPU.


typically things like "2-4 slices per CPU" are general rules of thumb 
because tasks are more io bound than not. depending on your workload 
this might change. it's probably one of the last things you'll want to 
optimize, first being the transformation ordering in your dag.




v) I assume ideally we should tune SPARK_WORKER_INSTANCES to
correspond to number of

Bests,

-Monir



best,


matt

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL check if query is completed (pyspark)

2014-09-06 Thread jamborta

Hi,

I am using Spark SQL to run some administrative queries and joins (e.g.
create table, insert overwrite, etc), where the query does not return any
data. I noticed if the query fails it prints some error message on the
console, but does not actually throw an exception (this is spark 1.0.2). 

Is there any way to get these errors from the returned object?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: unsubscribe

2014-09-06 Thread Nicholas Chammas

To unsubscribe send an email to user-unsubscr...@spark.apache.org

Links to sub/unsub are here: https://spark.apache.org/community.html


On Sat, Sep 6, 2014 at 7:52 AM, Derek Schoettle  wrote:

> Unsubscribe
>
> > On Sep 6, 2014, at 7:48 AM, "Murali Raju" 
> wrote:
> >
> >
>

Re: Support R in Spark

2014-09-06 Thread oppokui

Thanks, Christopher. I saw it before, it is amazing. Last time I try to 
download it from adatao, but no response after filling the table. How can I 
download it or its source code? What is the license?

Kui


> On Sep 6, 2014, at 8:08 PM, Christopher Nguyen  wrote:
> 
> Hi Kui,
> 
> DDF (open sourced) also aims to do something similar, adding RDBMS idioms, 
> and is already implemented on top of Spark.
> 
> One philosophy is that the DDF API aggressively hides the notion of parallel 
> datasets, exposing only (mutable) tables to users, on which they can apply R 
> and other familiar data mining/machine learning idioms, without having to 
> know about the distributed representation underneath. Now, you can get to the 
> underlying RDDs if you want to, simply by asking for it.
> 
> This was launched at the July Spark Summit. See 
> http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
>  .
> 
> Sent while mobile. Please excuse typos etc.
> 
> On Sep 4, 2014 1:59 PM, "Shivaram Venkataraman"  
> wrote:
> Thanks Kui. SparkR is a pretty young project, but there are a bunch of
> things we are working on. One of the main features is to expose a data
> frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
> be integrating this with Spark's MLLib.  At a high-level this will
> allow R users to use a familiar API but make use of MLLib's efficient
> distributed implementation. This is the same strategy used in Python
> as well.
> 
> Also we do hope to merge SparkR with mainline Spark -- we have a few
> features to complete before that and plan to shoot for integration by
> Spark 1.3.
> 
> Thanks
> Shivaram
> 
> On Wed, Sep 3, 2014 at 9:24 PM, oppokui  wrote:
> > Thanks, Shivaram.
> >
> > No specific use case yet. We try to use R in our project as data scientest
> > are all knowing R. We had a concern that how R handles the mass data. Spark
> > does a better work on big data area, and Spark ML is focusing on predictive
> > analysis area. Then we are thinking whether we can merge R and Spark
> > together. We tried SparkR and it is pretty easy to use. But we didn’t see
> > any feedback on this package in industry. It will be better if Spark team
> > has R support just like scala/Java/Python.
> >
> > Another question is that MLlib will re-implement all famous data mining
> > algorithms in Spark, then what is the purpose of using R?
> >
> > There is another technique for us H2O which support R natively. H2O is more
> > friendly to data scientist. I saw H2O can also work on Spark (Sparkling
> > Water).  It is better than using SparkR?
> >
> > Thanks and Regards.
> >
> > Kui
> >
> >
> > On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
> >  wrote:
> >
> > Hi
> >
> > Do you have a specific use-case where SparkR doesn't work well ? We'd love
> > to hear more about use-cases and features that can be improved with SparkR.
> >
> > Thanks
> > Shivaram
> >
> >
> > On Wed, Sep 3, 2014 at 3:19 AM, oppokui  wrote:
> >>
> >> Does spark ML team have plan to support R script natively? There is a
> >> SparkR project, but not from spark team. Spark ML used netlib-java to talk
> >> with native fortran routines or use NumPy, why not try to use R in some
> >> sense.
> >>
> >> R had lot of useful packages. If spark ML team can include R support, it
> >> will be a very powerful.
> >>
> >> Any comment?
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
> >
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Spark SQL check if query is completed (pyspark)

2014-09-06 Thread Davies Liu

The SQLContext.sql() will return an SchemaRDD, you need to call collect()
to pull the data in.

On Sat, Sep 6, 2014 at 6:02 AM, jamborta  wrote:
> Hi,
>
> I am using Spark SQL to run some administrative queries and joins (e.g.
> create table, insert overwrite, etc), where the query does not return any
> data. I noticed if the query fails it prints some error message on the
> console, but does not actually throw an exception (this is spark 1.0.2).
>
> Is there any way to get these errors from the returned object?
>
> thanks,
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Getting the type of an RDD in spark AND pyspark

2014-09-06 Thread Davies Liu

But you can not get what you expected in PySpark, because the RDD
 in Scala is serialized, so it will always be RDD[Array[Byte]], whatever
 the type of RDD in Python is.

Davies

On Sat, Sep 6, 2014 at 4:09 AM, Aaron Davidson  wrote:
> Pretty easy to do in Scala:
>
> rdd.elementClassTag.runtimeClass
>
> You can access this method from Python as well by using the internal _jrdd.
> It would look something like this (warning, I have not tested it):
> rdd._jrdd.classTag().runtimeClass()
>
> (The method name is "classTag" for JavaRDDLike, and "elementClassTag" for
> Scala's RDD.)
>
>
> On Thu, Sep 4, 2014 at 1:32 PM, esamanas  wrote:
>>
>> Hi,
>>
>> I'm new to spark and scala, so apologies if this is obvious.
>>
>> Every RDD appears to be typed, which I can see by seeing the output in the
>> spark-shell when I execute 'take':
>>
>> scala> val t = sc.parallelize(Array(1,2,3))
>> t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize
>> at :12
>>
>> scala> t.take(3)
>> res4: Array[Int] = Array(1, 2, 3)
>>
>>
>> scala> val u = sc.parallelize(Array(1,Array(2,2,2,2,2),3))
>> u: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[3] at parallelize
>> at :12
>>
>> scala> u.take(3)
>> res5: Array[Any] = Array(1, Array(2, 2, 2, 2, 2), 3)
>>
>> Array type stays the same even if only one type returned.
>> scala> u.take(1)
>> res6: Array[Any] = Array(1)
>>
>>
>> Is there some way to just get the name of the type of the entire RDD from
>> some function call?  Also, I would really like this same functionality in
>> pyspark, so I'm wondering if that exists on that side, since clearly the
>> underlying RDD is typed (I'd be fine with either the Scala or Python type
>> name).
>>
>> Thank you,
>>
>> Evan
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-type-of-an-RDD-in-spark-AND-pyspark-tp13498.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Q: About scenarios where driver execution flow may block...

2014-09-06 Thread didata


Hello friends:


I have a theory question about call blocking in a Spark driver.


Consider this (admittedly contrived =:)) snippet to illustrate this question...



x = rdd01.reduceByKey()  # or maybe some other 'shuffle-requiring action'.


b = sc.broadcast(x. take(20)) # Or any statement that requires the previous 
statement to complete, cluster-wide.



y = rdd02.someAction(f(b))



Would the first or second statement above block because the second (or 
third) statement needs to wait for the previous one to complete, cluster-wide?



Maybe this isn't the best example (typed on a phone), but generally I'm 
trying to understand the scenario(s) where a rdd call in the driver may 
block because the graph indicates that the next statement is dependent on 
the completion of the current one, cluster-wide (noy just lazy evaluated).


Thank you. :)


Sincerely yours,
Team Dimension Data

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-06 Thread DB Tsai

Yes. But you need to store RDD as *serialized* Java objects. See the
session of storage level
http://spark.apache.org/docs/latest/programming-guide.html

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Sep 4, 2014 at 8:06 PM, Jiusheng Chen 
wrote:

> Thanks DB.
> Did you mean this?
>
> spark.rdd.compress  true
>
>
> On Thu, Sep 4, 2014 at 2:48 PM, DB Tsai  wrote:
>
>> For saving the memory, I recommend you compress the cached RDD, and it
>> will be couple times smaller than original data sets.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Wed, Sep 3, 2014 at 9:28 PM, Jiusheng Chen 
>> wrote:
>>
>>> Thanks DB and Xiangrui. Glad to know you guys are actively working on it.
>>>
>>> Another thing, did we evaluate the loss of using Float to store values?
>>> currently it is Double. Use fewer bits has the benifit of memory footprint
>>> reduction. According to Google, they even uses 16 bits (a special
>>> encoding scheme called q2.13)
>>>  in their learner
>>> without measurable loss, and can save 75% memory.
>>>
>>>
>>> On Thu, Sep 4, 2014 at 11:02 AM, DB Tsai  wrote:
>>>
 With David's help today, we were able to implement elastic net glm in
 Spark. It's surprising easy, and with just some modification in breeze's
 OWLQN code, it just works without further investigation.

 We did benchmark, and the coefficients are within 0.5% differences
 compared with R's glmnet package. I guess this is first truly distributed
 glmnet implementation.

 Still require some effort to have it in mllib; mostly api cleanup work.

 1) I'll submit a PR to breeze which implements weighted regularization
 in OWLQN.
 2) This also depends on
 https://issues.apache.org/jira/browse/SPARK-2505 which we have
 internal version requiring some cleanup for open source project.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Wed, Sep 3, 2014 at 7:34 PM, Xiangrui Meng  wrote:

> +DB & David (They implemented QWLQN on Spark today.)
> On Sep 3, 2014 7:18 PM, "Jiusheng Chen" 
> wrote:
>
>> Hi Xiangrui,
>>
>> A side-by question about MLLib.
>> It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only
>> support L2 regurization, the doc explains it: "The L1 regularization by
>> using L1Updater
>> 
>> will not work since the soft-thresholding logic in L1Updater is designed
>> for gradient descent."
>>
>> Since the algorithm comes from Breeze and I noticed Breeze actually
>> supports L1 (OWLQN
>> ). Wondering
>> if there is some special considerations that current MLLib didn't support
>> OWLQN? And any plan to add it?
>>
>> Thanks for your time!
>>
>>
>>
>> On Fri, Aug 22, 2014 at 12:56 PM, ZHENG, Xu-dong 
>> wrote:
>>
>>> Update.
>>>
>>> I just find a magic parameter *blanceSlack* in *CoalescedRDD*,
>>> which sounds could control the locality. The default value is 0.1 
>>> (smaller
>>> value means lower locality). I change it to 1.0 (full locality) and use 
>>> #3
>>> approach, then find a lot improvement (20%~40%). Although the Web UI 
>>> still
>>> shows the type of task as 'ANY' and the input is from shuffle read, but 
>>> the
>>> real performance is much better before change this parameter.
>>> [image: Inline image 1]
>>>
>>> I think the benefit includes:
>>>
>>> 1. This approach keep the physical partition size small, but make
>>> each task handle multiple partitions. So the memory requested for
>>> deserialization is reduced, which also reduce the GC time. That is 
>>> exactly
>>> what we observed in our job.
>>>
>>> 2. This approach will not hit the 2G limitation, because it not
>>> change the partition size.
>>>
>>> And I also think that, Spark may change this default value, or at
>>> least expose this parameter to users (*CoalescedRDD *is a private
>>> class, and *RDD*.*coalesce* also don't have a parameter to control
>>> that).
>>>
>>> On Wed, Aug 13, 2014 at 12:28 AM, Xiangrui Meng 
>>> wrote:
>>>
 Sorry, I missed #2. My suggestion is the same as #2. You need to
 set a
 bigger numPartitions to avoid hitting integer bound or 2G
 limitation,
 at the cost of increased shu

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen

Hi Kui, sorry about that. That link you mentioned is probably the one for
the products. We don't have one pointing from adatao.com to ddf.io; maybe
we'll add it.

As for access to the code base itself, I think the team has already created
a GitHub repo for it, and should open it up within a few weeks. There's
some debate about whether to put out the implementation with Shark
dependencies now, or SparkSQL with a bit limited functionality and not as
well tested.

I'll check and ping when this is opened up.

The license is Apache.

Sent while mobile. Please excuse typos etc.
On Sep 6, 2014 1:39 PM, "oppokui"  wrote:

> Thanks, Christopher. I saw it before, it is amazing. Last time I try to
> download it from adatao, but no response after filling the table. How can I
> download it or its source code? What is the license?
>
> Kui
>
>
> On Sep 6, 2014, at 8:08 PM, Christopher Nguyen  wrote:
>
> Hi Kui,
>
> DDF (open sourced) also aims to do something similar, adding RDBMS idioms,
> and is already implemented on top of Spark.
>
> One philosophy is that the DDF API aggressively hides the notion of
> parallel datasets, exposing only (mutable) tables to users, on which they
> can apply R and other familiar data mining/machine learning idioms, without
> having to know about the distributed representation underneath. Now, you
> can get to the underlying RDDs if you want to, simply by asking for it.
>
> This was launched at the July Spark Summit. See
> http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
> .
>
> Sent while mobile. Please excuse typos etc.
> On Sep 4, 2014 1:59 PM, "Shivaram Venkataraman" <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Thanks Kui. SparkR is a pretty young project, but there are a bunch of
>> things we are working on. One of the main features is to expose a data
>> frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
>> be integrating this with Spark's MLLib.  At a high-level this will
>> allow R users to use a familiar API but make use of MLLib's efficient
>> distributed implementation. This is the same strategy used in Python
>> as well.
>>
>> Also we do hope to merge SparkR with mainline Spark -- we have a few
>> features to complete before that and plan to shoot for integration by
>> Spark 1.3.
>>
>> Thanks
>> Shivaram
>>
>> On Wed, Sep 3, 2014 at 9:24 PM, oppokui  wrote:
>> > Thanks, Shivaram.
>> >
>> > No specific use case yet. We try to use R in our project as data
>> scientest
>> > are all knowing R. We had a concern that how R handles the mass data.
>> Spark
>> > does a better work on big data area, and Spark ML is focusing on
>> predictive
>> > analysis area. Then we are thinking whether we can merge R and Spark
>> > together. We tried SparkR and it is pretty easy to use. But we didn’t
>> see
>> > any feedback on this package in industry. It will be better if Spark
>> team
>> > has R support just like scala/Java/Python.
>> >
>> > Another question is that MLlib will re-implement all famous data mining
>> > algorithms in Spark, then what is the purpose of using R?
>> >
>> > There is another technique for us H2O which support R natively. H2O is
>> more
>> > friendly to data scientist. I saw H2O can also work on Spark (Sparkling
>> > Water).  It is better than using SparkR?
>> >
>> > Thanks and Regards.
>> >
>> > Kui
>> >
>> >
>> > On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
>> >  wrote:
>> >
>> > Hi
>> >
>> > Do you have a specific use-case where SparkR doesn't work well ? We'd
>> love
>> > to hear more about use-cases and features that can be improved with
>> SparkR.
>> >
>> > Thanks
>> > Shivaram
>> >
>> >
>> > On Wed, Sep 3, 2014 at 3:19 AM, oppokui  wrote:
>> >>
>> >> Does spark ML team have plan to support R script natively? There is a
>> >> SparkR project, but not from spark team. Spark ML used netlib-java to
>> talk
>> >> with native fortran routines or use NumPy, why not try to use R in some
>> >> sense.
>> >>
>> >> R had lot of useful packages. If spark ML team can include R support,
>> it
>> >> will be a very powerful.
>> >>
>> >> Any comment?
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-06 Thread Victor Tso-Guillen

I ran into the same issue. What I did was use maven shade plugin to shade
my version of httpcomponents libraries into another package.


On Fri, Sep 5, 2014 at 4:33 PM, Penny Espinoza <
pesp...@societyconsulting.com> wrote:

>  Hey - I’m struggling with some dependency issues with
> org.apache.httpcomponents httpcore and httpclient when using spark-submit
> with YARN running Spark 1.0.2 on a Hadoop 2.2 cluster.  I’ve seen several
> posts about this issue, but no resolution.
>
>  The error message is this:
>
>
>  Caused by: java.lang.NoSuchMethodError:
> org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
> at
> org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
> at
> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:114)
> at
> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:99)
> at
> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:85)
> at
> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:93)
> at
> com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
> at
> com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96)
> at
> com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:155)
> at
> com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:118)
> at
> com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:102)
> at
> com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:332)
> at
> com.oncue.rna.realtime.streaming.config.package$.transferManager(package.scala:76)
> at
> com.oncue.rna.realtime.streaming.models.S3SchemaRegistry.(SchemaRegistry.scala:27)
> at
> com.oncue.rna.realtime.streaming.models.S3SchemaRegistry$.schemaRegistry$lzycompute(SchemaRegistry.scala:46)
> at
> com.oncue.rna.realtime.streaming.models.S3SchemaRegistry$.schemaRegistry(SchemaRegistry.scala:44)
> at
> com.oncue.rna.realtime.streaming.coders.KafkaAvroDecoder.(KafkaAvroDecoder.scala:20)
> ... 17 more
>
>   The apache httpcomponents libraries include the method above as of
> version 4.2.  The Spark 1.0.2 binaries seem to include version 4.1.
>
>  I can get this to work in my driver program by adding exclusions to
> force use of 4.1, but then I get the error in tasks even when using the
> —jars option of the spark-submit command.  How can I get both the driver
> program and the individual tasks in my spark-streaming job to use the same
> version of this library so my job will run all the way through?
>
>  thanks
> p
>

Spark Streaming and database access (e.g. MySQL)

2014-09-06 Thread jchen

Hi,

Has someone tried using Spark Streaming with MySQL (or any other
database/data store)? I can write to MySQL at the beginning of the driver
application. However, when I am trying to write the result of every
streaming processing window to MySQL, it fails with the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException:
com.mysql.jdbc.JDBC4PreparedStatement

I think it is because the statement object should be serializable, in order
to be executed on the worker node. Has someone tried the similar cases?
Example code will be very helpful. My intension is to execute
INSERT/UPDATE/DELETE/SELECT statements for each sliding window.

Thanks,
JC



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-database-access-e-g-MySQL-tp13644.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to change the values in Array of Bytes

Re: Support R in Spark

Re: error: type mismatch while Union

Re: How to change the values in Array of Bytes

Re: error: type mismatch while Union

Re: question on replicate() in blockManager.scala

Re: Task not serializable

Re: Getting the type of an RDD in spark AND pyspark

unsubscribe

Re: unsubscribe

Re: Support R in Spark

Re: Task not serializable

Re: How spark parallelize maps Slices to tasks/executors/workers

Spark SQL check if query is completed (pyspark)

Re: unsubscribe

Re: Support R in Spark

Re: Spark SQL check if query is completed (pyspark)

Re: Getting the type of an RDD in spark AND pyspark

Q: About scenarios where driver execution flow may block...

Re: Is there any way to control the parallelism in LogisticRegression

Re: Support R in Spark

Re: prepending jars to the driver class path for spark-submit on YARN

Spark Streaming and database access (e.g. MySQL)

23 matches

Site Navigation

Mail list logo

Footer information