How to change the values in Array of Bytes

2014-09-06 Thread Deep Pradhan
Hi,
I have an array of bytes and I have filled the array with 0 in all the
postitions.


*var Array = Array.fill[Byte](10)(0)*
Now, if certain conditions are satisfied, I want to change some elements of
the array to 1 instead of 0. If I run,



*if (Array.apply(index)==0) Array.apply(index) = 1*
it returns me an error.

But if I assign *Array.apply(index) *to a variable and do the same thing
then it works. I do not want to assign this to variables because if I do
this, I would be creating a lot of variables.

Can anyone tell me a method to do this?

Thank You


Re: Support R in Spark

2014-09-06 Thread oppokui
Cool! It is a very good news. Can’t wait for it.

Kui 

 On Sep 5, 2014, at 1:58 AM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:
 
 Thanks Kui. SparkR is a pretty young project, but there are a bunch of
 things we are working on. One of the main features is to expose a data
 frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
 be integrating this with Spark's MLLib.  At a high-level this will
 allow R users to use a familiar API but make use of MLLib's efficient
 distributed implementation. This is the same strategy used in Python
 as well.
 
 Also we do hope to merge SparkR with mainline Spark -- we have a few
 features to complete before that and plan to shoot for integration by
 Spark 1.3.
 
 Thanks
 Shivaram
 
 On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote:
 Thanks, Shivaram.
 
 No specific use case yet. We try to use R in our project as data scientest
 are all knowing R. We had a concern that how R handles the mass data. Spark
 does a better work on big data area, and Spark ML is focusing on predictive
 analysis area. Then we are thinking whether we can merge R and Spark
 together. We tried SparkR and it is pretty easy to use. But we didn’t see
 any feedback on this package in industry. It will be better if Spark team
 has R support just like scala/Java/Python.
 
 Another question is that MLlib will re-implement all famous data mining
 algorithms in Spark, then what is the purpose of using R?
 
 There is another technique for us H2O which support R natively. H2O is more
 friendly to data scientist. I saw H2O can also work on Spark (Sparkling
 Water).  It is better than using SparkR?
 
 Thanks and Regards.
 
 Kui
 
 
 On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
 
 Hi
 
 Do you have a specific use-case where SparkR doesn't work well ? We'd love
 to hear more about use-cases and features that can be improved with SparkR.
 
 Thanks
 Shivaram
 
 
 On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote:
 
 Does spark ML team have plan to support R script natively? There is a
 SparkR project, but not from spark team. Spark ML used netlib-java to talk
 with native fortran routines or use NumPy, why not try to use R in some
 sense.
 
 R had lot of useful packages. If spark ML team can include R support, it
 will be a very powerful.
 
 Any comment?
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: error: type mismatch while Union

2014-09-06 Thread Dhimant
I am using Spark version 1.0.2




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13618.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to change the values in Array of Bytes

2014-09-06 Thread Aaron Davidson
More of a Scala question than Spark, but apply here can be written with
just parentheses like this:

val array = Array.fill[Byte](10)(0)
if (array(index) == 0) {
  array(index) = 1
}

The second instance of array(index) = 1 is actually not calling apply,
but update. It's a scala-ism that's usually invisible. The above is
equivalent to:

if (array.apply(index) == 0) {
  array.update(index, 1)
}



On Sat, Sep 6, 2014 at 2:09 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I have an array of bytes and I have filled the array with 0 in all the
 postitions.


 *var Array = Array.fill[Byte](10)(0)*
 Now, if certain conditions are satisfied, I want to change some elements
 of the array to 1 instead of 0. If I run,



 *if (Array.apply(index)==0) Array.apply(index) = 1*
 it returns me an error.

 But if I assign *Array.apply(index) *to a variable and do the same thing
 then it works. I do not want to assign this to variables because if I do
 this, I would be creating a lot of variables.

 Can anyone tell me a method to do this?

 Thank You





Re: error: type mismatch while Union

2014-09-06 Thread Aaron Davidson
Are you doing this from the spark-shell? You're probably running into
https://issues.apache.org/jira/browse/SPARK-1199 which should be fixed in
1.1.


On Sat, Sep 6, 2014 at 3:03 AM, Dhimant dhimant84.jays...@gmail.com wrote:

 I am using Spark version 1.0.2




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13618.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: question on replicate() in blockManager.scala

2014-09-06 Thread Aaron Davidson
Looks like that's BlockManagerWorker.syncPutBlock(), which is in an if
check, perhaps obscuring its existence.


On Fri, Sep 5, 2014 at 2:19 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,

 var cachedPeers: Seq[BlockManagerId] = null
   private def replicate(blockId: String, data: ByteBuffer, level:
 StorageLevel) {
 val tLevel = StorageLevel(level.useDisk, level.useMemory,
 level.deserialized, 1)
 if (cachedPeers == null) {
   cachedPeers = master.getPeers(blockManagerId, level.replication - 1)
 }
 for (peer: BlockManagerId - cachedPeers) {
   val start = System.nanoTime
   data.rewind()
   logDebug(Try to replicate BlockId  + blockId +  once; The size of
 the data is 
 + data.limit() +  Bytes. To node:  + peer)
   if (!BlockManagerWorker.syncPutBlock(PutBlock(blockId, data,
 tLevel),
 new ConnectionManagerId(peer.host, peer.port))) {
 logError(Failed to call syncPutBlock to  + peer)
   }
   logDebug(Replicated BlockId  + blockId +  once used  +
 (System.nanoTime - start) / 1e6 +  s; The size of the data is  +
 data.limit() +  bytes.)
 }


 I get the flow of this code. But, I dont find any method being called for
 actually writing the data into the set of peers chosen for replication.

 Where exaclty is the replication happening?

 Thank you!!
 -Karthik



Re: Task not serializable

2014-09-06 Thread Sean Owen
I disagree that the generally right change is to try to make the
classes serializable. Usually, classes that are not serializable are
not supposed to be serialized. You're using them in a way that's
causing them to be serialized, and that's probably not desired.

For example, this is wrong:

val foo: SomeUnserializableManagerClass = ...
rdd.map(d = foo.bar(d))

This is right:

rdd.map { d =
  val foo: SomeUnserializableManagerClass = ...
  foo.bar(d)
}

In the first instance, you create the object on the driver and try to
serialize and copy it to workers. In the second, you're creating
SomeUnserializableManagerClass in the function and therefore on the
worker.

mapPartitions is better if this creation is expensive.

On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
 Hi,

 I'm trying to migrate a map-reduce program to work with spark. I migrated
 the program from Java to Scala. The map-reduce program basically loads a
 HDFS file and for each line in the file it applies several transformation
 functions available in various external libraries.

 When I execute this over spark, it is throwing me Task not serializable
 exceptions for each and every class being used from these from external
 libraries. I included serialization to few classes which are in my scope,
 but there there are several other classes which are out of my scope like
 org.apache.hadoop.io.Text.

 How to overcome these exceptions?

 ~Sarath.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Getting the type of an RDD in spark AND pyspark

2014-09-06 Thread Aaron Davidson
Pretty easy to do in Scala:

rdd.elementClassTag.runtimeClass

You can access this method from Python as well by using the internal _jrdd.
It would look something like this (warning, I have not tested it):
rdd._jrdd.classTag().runtimeClass()

(The method name is classTag for JavaRDDLike, and elementClassTag for
Scala's RDD.)


On Thu, Sep 4, 2014 at 1:32 PM, esamanas evan.sama...@gmail.com wrote:

 Hi,

 I'm new to spark and scala, so apologies if this is obvious.

 Every RDD appears to be typed, which I can see by seeing the output in the
 spark-shell when I execute 'take':

 scala val t = sc.parallelize(Array(1,2,3))
 t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize
 at console:12

 scala t.take(3)
 res4: Array[Int] = Array(1, 2, 3)


 scala val u = sc.parallelize(Array(1,Array(2,2,2,2,2),3))
 u: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[3] at parallelize
 at console:12

 scala u.take(3)
 res5: Array[Any] = Array(1, Array(2, 2, 2, 2, 2), 3)

 Array type stays the same even if only one type returned.
 scala u.take(1)
 res6: Array[Any] = Array(1)


 Is there some way to just get the name of the type of the entire RDD from
 some function call?  Also, I would really like this same functionality in
 pyspark, so I'm wondering if that exists on that side, since clearly the
 underlying RDD is typed (I'd be fine with either the Scala or Python type
 name).

 Thank you,

 Evan




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-type-of-an-RDD-in-spark-AND-pyspark-tp13498.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




unsubscribe

2014-09-06 Thread Murali Raju



Re: unsubscribe

2014-09-06 Thread Derek Schoettle

Unsubscribe

 On Sep 6, 2014, at 7:48 AM, Murali Raju murali.r...@infrastacks.com
wrote:



Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen
Hi Kui,

DDF (open sourced) also aims to do something similar, adding RDBMS idioms,
and is already implemented on top of Spark.

One philosophy is that the DDF API aggressively hides the notion of
parallel datasets, exposing only (mutable) tables to users, on which they
can apply R and other familiar data mining/machine learning idioms, without
having to know about the distributed representation underneath. Now, you
can get to the underlying RDDs if you want to, simply by asking for it.

This was launched at the July Spark Summit. See
http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
.

Sent while mobile. Please excuse typos etc.
On Sep 4, 2014 1:59 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu
wrote:

 Thanks Kui. SparkR is a pretty young project, but there are a bunch of
 things we are working on. One of the main features is to expose a data
 frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
 be integrating this with Spark's MLLib.  At a high-level this will
 allow R users to use a familiar API but make use of MLLib's efficient
 distributed implementation. This is the same strategy used in Python
 as well.

 Also we do hope to merge SparkR with mainline Spark -- we have a few
 features to complete before that and plan to shoot for integration by
 Spark 1.3.

 Thanks
 Shivaram

 On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote:
  Thanks, Shivaram.
 
  No specific use case yet. We try to use R in our project as data
 scientest
  are all knowing R. We had a concern that how R handles the mass data.
 Spark
  does a better work on big data area, and Spark ML is focusing on
 predictive
  analysis area. Then we are thinking whether we can merge R and Spark
  together. We tried SparkR and it is pretty easy to use. But we didn’t see
  any feedback on this package in industry. It will be better if Spark team
  has R support just like scala/Java/Python.
 
  Another question is that MLlib will re-implement all famous data mining
  algorithms in Spark, then what is the purpose of using R?
 
  There is another technique for us H2O which support R natively. H2O is
 more
  friendly to data scientist. I saw H2O can also work on Spark (Sparkling
  Water).  It is better than using SparkR?
 
  Thanks and Regards.
 
  Kui
 
 
  On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
 
  Hi
 
  Do you have a specific use-case where SparkR doesn't work well ? We'd
 love
  to hear more about use-cases and features that can be improved with
 SparkR.
 
  Thanks
  Shivaram
 
 
  On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote:
 
  Does spark ML team have plan to support R script natively? There is a
  SparkR project, but not from spark team. Spark ML used netlib-java to
 talk
  with native fortran routines or use NumPy, why not try to use R in some
  sense.
 
  R had lot of useful packages. If spark ML team can include R support, it
  will be a very powerful.
 
  Any comment?
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Task not serializable

2014-09-06 Thread Sarath Chandra
Thanks Alok, Sean.

As suggested by Sean, I tried a sample program. I wrote a function in which
I made a reference to a class from third party library that is not
serialized and passed it to my map function. On executing I got same
exception.

Then I modified the program removed function and written it's contents as
anonymous function inside map function. This time the execution succeeded.

I understood the explanation of Sean. But request for references to a more
detailed explanation and examples for writing efficient spark programs
avoiding such pitfalls.

~Sarath
 On 06-Sep-2014 4:32 pm, Sean Owen so...@cloudera.com wrote:

 I disagree that the generally right change is to try to make the
 classes serializable. Usually, classes that are not serializable are
 not supposed to be serialized. You're using them in a way that's
 causing them to be serialized, and that's probably not desired.

 For example, this is wrong:

 val foo: SomeUnserializableManagerClass = ...
 rdd.map(d = foo.bar(d))

 This is right:

 rdd.map { d =
   val foo: SomeUnserializableManagerClass = ...
   foo.bar(d)
 }

 In the first instance, you create the object on the driver and try to
 serialize and copy it to workers. In the second, you're creating
 SomeUnserializableManagerClass in the function and therefore on the
 worker.

 mapPartitions is better if this creation is expensive.

 On Fri, Sep 5, 2014 at 3:06 PM, Sarath Chandra
 sarathchandra.jos...@algofusiontech.com wrote:
  Hi,
 
  I'm trying to migrate a map-reduce program to work with spark. I migrated
  the program from Java to Scala. The map-reduce program basically loads a
  HDFS file and for each line in the file it applies several transformation
  functions available in various external libraries.
 
  When I execute this over spark, it is throwing me Task not serializable
  exceptions for each and every class being used from these from external
  libraries. I included serialization to few classes which are in my scope,
  but there there are several other classes which are out of my scope like
  org.apache.hadoop.io.Text.
 
  How to overcome these exceptions?
 
  ~Sarath.



Re: How spark parallelize maps Slices to tasks/executors/workers

2014-09-06 Thread Matthew Farrellee

On 09/04/2014 09:55 PM, Mozumder, Monir wrote:

I have this 2-node cluster setup, where each node has 4-cores.

 MASTER

 (Worker-on-master)  (Worker-on-node1)

(slaves(master,node1))

SPARK_WORKER_INSTANCES=1

I am trying to understand Spark's parallelize behavior. The sparkPi
example has this code:

 val slices = 8

 val n = 10 * slices

 val count = spark.parallelize(1 to n, slices).map { i =

   val x = random * 2 - 1

   val y = random * 2 - 1

   if (x*x + y*y  1) 1 else 0

 }.reduce(_ + _)

As per documentation: Spark will run one task for each slice of the
cluster. Typically you want 2-4 slices for each CPU in your cluster. I
set slices to be 8 which means the workingset will be divided among 8
tasks on the cluster, in turn each worker node gets 4 tasks (1:1 per core)

Questions:

i)  Where can I see task level details? Inside executors I dont see
task breakdown so I can see the effect of slices on the UI.


under http://localhost:4040/stages/ you can drill into individual stages 
to see task details




ii) How to  programmatically find the working set size for the map
function above? I assume it is n/slices (10 above)


it'll be roughly n/slices. you can mapPqrtitions() and check their length



iii) Are the multiple tasks run by an executor run sequentially or
paralelly in multiple threads?


parallel. have a look at 
https://spark.apache.org/docs/latest/cluster-overview.html




iv) Reasoning behind 2-4 slices per CPU.


typically things like 2-4 slices per CPU are general rules of thumb 
because tasks are more io bound than not. depending on your workload 
this might change. it's probably one of the last things you'll want to 
optimize, first being the transformation ordering in your dag.




v) I assume ideally we should tune SPARK_WORKER_INSTANCES to
correspond to number of

Bests,

-Monir



best,


matt

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL check if query is completed (pyspark)

2014-09-06 Thread jamborta
Hi,

I am using Spark SQL to run some administrative queries and joins (e.g.
create table, insert overwrite, etc), where the query does not return any
data. I noticed if the query fails it prints some error message on the
console, but does not actually throw an exception (this is spark 1.0.2). 

Is there any way to get these errors from the returned object?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: unsubscribe

2014-09-06 Thread Nicholas Chammas
To unsubscribe send an email to user-unsubscr...@spark.apache.org

Links to sub/unsub are here: https://spark.apache.org/community.html


On Sat, Sep 6, 2014 at 7:52 AM, Derek Schoettle dscho...@us.ibm.com wrote:

 Unsubscribe

  On Sep 6, 2014, at 7:48 AM, Murali Raju murali.r...@infrastacks.com
 wrote:
 
 



Re: Support R in Spark

2014-09-06 Thread oppokui
Thanks, Christopher. I saw it before, it is amazing. Last time I try to 
download it from adatao, but no response after filling the table. How can I 
download it or its source code? What is the license?

Kui


 On Sep 6, 2014, at 8:08 PM, Christopher Nguyen c...@adatao.com wrote:
 
 Hi Kui,
 
 DDF (open sourced) also aims to do something similar, adding RDBMS idioms, 
 and is already implemented on top of Spark.
 
 One philosophy is that the DDF API aggressively hides the notion of parallel 
 datasets, exposing only (mutable) tables to users, on which they can apply R 
 and other familiar data mining/machine learning idioms, without having to 
 know about the distributed representation underneath. Now, you can get to the 
 underlying RDDs if you want to, simply by asking for it.
 
 This was launched at the July Spark Summit. See 
 http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
  .
 
 Sent while mobile. Please excuse typos etc.
 
 On Sep 4, 2014 1:59 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu 
 wrote:
 Thanks Kui. SparkR is a pretty young project, but there are a bunch of
 things we are working on. One of the main features is to expose a data
 frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
 be integrating this with Spark's MLLib.  At a high-level this will
 allow R users to use a familiar API but make use of MLLib's efficient
 distributed implementation. This is the same strategy used in Python
 as well.
 
 Also we do hope to merge SparkR with mainline Spark -- we have a few
 features to complete before that and plan to shoot for integration by
 Spark 1.3.
 
 Thanks
 Shivaram
 
 On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote:
  Thanks, Shivaram.
 
  No specific use case yet. We try to use R in our project as data scientest
  are all knowing R. We had a concern that how R handles the mass data. Spark
  does a better work on big data area, and Spark ML is focusing on predictive
  analysis area. Then we are thinking whether we can merge R and Spark
  together. We tried SparkR and it is pretty easy to use. But we didn’t see
  any feedback on this package in industry. It will be better if Spark team
  has R support just like scala/Java/Python.
 
  Another question is that MLlib will re-implement all famous data mining
  algorithms in Spark, then what is the purpose of using R?
 
  There is another technique for us H2O which support R natively. H2O is more
  friendly to data scientist. I saw H2O can also work on Spark (Sparkling
  Water).  It is better than using SparkR?
 
  Thanks and Regards.
 
  Kui
 
 
  On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
 
  Hi
 
  Do you have a specific use-case where SparkR doesn't work well ? We'd love
  to hear more about use-cases and features that can be improved with SparkR.
 
  Thanks
  Shivaram
 
 
  On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote:
 
  Does spark ML team have plan to support R script natively? There is a
  SparkR project, but not from spark team. Spark ML used netlib-java to talk
  with native fortran routines or use NumPy, why not try to use R in some
  sense.
 
  R had lot of useful packages. If spark ML team can include R support, it
  will be a very powerful.
 
  Any comment?
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Spark SQL check if query is completed (pyspark)

2014-09-06 Thread Davies Liu
The SQLContext.sql() will return an SchemaRDD, you need to call collect()
to pull the data in.

On Sat, Sep 6, 2014 at 6:02 AM, jamborta jambo...@gmail.com wrote:
 Hi,

 I am using Spark SQL to run some administrative queries and joins (e.g.
 create table, insert overwrite, etc), where the query does not return any
 data. I noticed if the query fails it prints some error message on the
 console, but does not actually throw an exception (this is spark 1.0.2).

 Is there any way to get these errors from the returned object?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Getting the type of an RDD in spark AND pyspark

2014-09-06 Thread Davies Liu
But you can not get what you expected in PySpark, because the RDD
 in Scala is serialized, so it will always be RDD[Array[Byte]], whatever
 the type of RDD in Python is.

Davies

On Sat, Sep 6, 2014 at 4:09 AM, Aaron Davidson ilike...@gmail.com wrote:
 Pretty easy to do in Scala:

 rdd.elementClassTag.runtimeClass

 You can access this method from Python as well by using the internal _jrdd.
 It would look something like this (warning, I have not tested it):
 rdd._jrdd.classTag().runtimeClass()

 (The method name is classTag for JavaRDDLike, and elementClassTag for
 Scala's RDD.)


 On Thu, Sep 4, 2014 at 1:32 PM, esamanas evan.sama...@gmail.com wrote:

 Hi,

 I'm new to spark and scala, so apologies if this is obvious.

 Every RDD appears to be typed, which I can see by seeing the output in the
 spark-shell when I execute 'take':

 scala val t = sc.parallelize(Array(1,2,3))
 t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize
 at console:12

 scala t.take(3)
 res4: Array[Int] = Array(1, 2, 3)


 scala val u = sc.parallelize(Array(1,Array(2,2,2,2,2),3))
 u: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[3] at parallelize
 at console:12

 scala u.take(3)
 res5: Array[Any] = Array(1, Array(2, 2, 2, 2, 2), 3)

 Array type stays the same even if only one type returned.
 scala u.take(1)
 res6: Array[Any] = Array(1)


 Is there some way to just get the name of the type of the entire RDD from
 some function call?  Also, I would really like this same functionality in
 pyspark, so I'm wondering if that exists on that side, since clearly the
 underlying RDD is typed (I'd be fine with either the Scala or Python type
 name).

 Thank you,

 Evan




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-type-of-an-RDD-in-spark-AND-pyspark-tp13498.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Q: About scenarios where driver execution flow may block...

2014-09-06 Thread didata

Hello friends:


I have a theory question about call blocking in a Spark driver.


Consider this (admittedly contrived =:)) snippet to illustrate this question...



x = rdd01.reduceByKey()  # or maybe some other 'shuffle-requiring action'.


b = sc.broadcast(x. take(20)) # Or any statement that requires the previous 
statement to complete, cluster-wide.



y = rdd02.someAction(f(b))



Would the first or second statement above block because the second (or 
third) statement needs to wait for the previous one to complete, cluster-wide?



Maybe this isn't the best example (typed on a phone), but generally I'm 
trying to understand the scenario(s) where a rdd call in the driver may 
block because the graph indicates that the next statement is dependent on 
the completion of the current one, cluster-wide (noy just lazy evaluated).


Thank you. :)


Sincerely yours,
Team Dimension Data


Re: Is there any way to control the parallelism in LogisticRegression

2014-09-06 Thread DB Tsai
Yes. But you need to store RDD as *serialized* Java objects. See the
session of storage level
http://spark.apache.org/docs/latest/programming-guide.html

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Sep 4, 2014 at 8:06 PM, Jiusheng Chen chenjiush...@gmail.com
wrote:

 Thanks DB.
 Did you mean this?

 spark.rdd.compress  true


 On Thu, Sep 4, 2014 at 2:48 PM, DB Tsai dbt...@dbtsai.com wrote:

 For saving the memory, I recommend you compress the cached RDD, and it
 will be couple times smaller than original data sets.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Wed, Sep 3, 2014 at 9:28 PM, Jiusheng Chen chenjiush...@gmail.com
 wrote:

 Thanks DB and Xiangrui. Glad to know you guys are actively working on it.

 Another thing, did we evaluate the loss of using Float to store values?
 currently it is Double. Use fewer bits has the benifit of memory footprint
 reduction. According to Google, they even uses 16 bits (a special
 encoding scheme called q2.13)
 http://jmlr.org/proceedings/papers/v28/golovin13.pdf in their learner
 without measurable loss, and can save 75% memory.


 On Thu, Sep 4, 2014 at 11:02 AM, DB Tsai dbt...@dbtsai.com wrote:

 With David's help today, we were able to implement elastic net glm in
 Spark. It's surprising easy, and with just some modification in breeze's
 OWLQN code, it just works without further investigation.

 We did benchmark, and the coefficients are within 0.5% differences
 compared with R's glmnet package. I guess this is first truly distributed
 glmnet implementation.

 Still require some effort to have it in mllib; mostly api cleanup work.

 1) I'll submit a PR to breeze which implements weighted regularization
 in OWLQN.
 2) This also depends on
 https://issues.apache.org/jira/browse/SPARK-2505 which we have
 internal version requiring some cleanup for open source project.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Wed, Sep 3, 2014 at 7:34 PM, Xiangrui Meng men...@gmail.com wrote:

 +DB  David (They implemented QWLQN on Spark today.)
 On Sep 3, 2014 7:18 PM, Jiusheng Chen chenjiush...@gmail.com
 wrote:

 Hi Xiangrui,

 A side-by question about MLLib.
 It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only
 support L2 regurization, the doc explains it: The L1 regularization by
 using L1Updater
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.optimization.L1Updater
 will not work since the soft-thresholding logic in L1Updater is designed
 for gradient descent.

 Since the algorithm comes from Breeze and I noticed Breeze actually
 supports L1 (OWLQN
 http://www.scalanlp.org/api/breeze/#breeze.optimize.OWLQN). Wondering
 if there is some special considerations that current MLLib didn't support
 OWLQN? And any plan to add it?

 Thanks for your time!



 On Fri, Aug 22, 2014 at 12:56 PM, ZHENG, Xu-dong dong...@gmail.com
 wrote:

 Update.

 I just find a magic parameter *blanceSlack* in *CoalescedRDD*,
 which sounds could control the locality. The default value is 0.1 
 (smaller
 value means lower locality). I change it to 1.0 (full locality) and use 
 #3
 approach, then find a lot improvement (20%~40%). Although the Web UI 
 still
 shows the type of task as 'ANY' and the input is from shuffle read, but 
 the
 real performance is much better before change this parameter.
 [image: Inline image 1]

 I think the benefit includes:

 1. This approach keep the physical partition size small, but make
 each task handle multiple partitions. So the memory requested for
 deserialization is reduced, which also reduce the GC time. That is 
 exactly
 what we observed in our job.

 2. This approach will not hit the 2G limitation, because it not
 change the partition size.

 And I also think that, Spark may change this default value, or at
 least expose this parameter to users (*CoalescedRDD *is a private
 class, and *RDD*.*coalesce* also don't have a parameter to control
 that).

 On Wed, Aug 13, 2014 at 12:28 AM, Xiangrui Meng men...@gmail.com
 wrote:

 Sorry, I missed #2. My suggestion is the same as #2. You need to
 set a
 bigger numPartitions to avoid hitting integer bound or 2G
 limitation,
 at the cost of increased shuffle size per iteration. If you use a
 CombineInputFormat and then cache, it will try to give you roughly
 the
 same size per partition. There will be some remote fetches from HDFS
 but still cheaper than calling RDD.repartition().

 For coalesce without shuffle, I don't know how to set the right
 number
 of partitions either ...

 -Xiangrui

 On Tue, Aug 12, 2014 at 6:16 AM, ZHENG, Xu-dong dong...@gmail.com
 wrote:
  Hi Xiangrui,
 
  Thanks for your reply!
 
  Yes, 

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen
Hi Kui, sorry about that. That link you mentioned is probably the one for
the products. We don't have one pointing from adatao.com to ddf.io; maybe
we'll add it.

As for access to the code base itself, I think the team has already created
a GitHub repo for it, and should open it up within a few weeks. There's
some debate about whether to put out the implementation with Shark
dependencies now, or SparkSQL with a bit limited functionality and not as
well tested.

I'll check and ping when this is opened up.

The license is Apache.

Sent while mobile. Please excuse typos etc.
On Sep 6, 2014 1:39 PM, oppokui oppo...@gmail.com wrote:

 Thanks, Christopher. I saw it before, it is amazing. Last time I try to
 download it from adatao, but no response after filling the table. How can I
 download it or its source code? What is the license?

 Kui


 On Sep 6, 2014, at 8:08 PM, Christopher Nguyen c...@adatao.com wrote:

 Hi Kui,

 DDF (open sourced) also aims to do something similar, adding RDBMS idioms,
 and is already implemented on top of Spark.

 One philosophy is that the DDF API aggressively hides the notion of
 parallel datasets, exposing only (mutable) tables to users, on which they
 can apply R and other familiar data mining/machine learning idioms, without
 having to know about the distributed representation underneath. Now, you
 can get to the underlying RDDs if you want to, simply by asking for it.

 This was launched at the July Spark Summit. See
 http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
 .

 Sent while mobile. Please excuse typos etc.
 On Sep 4, 2014 1:59 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Thanks Kui. SparkR is a pretty young project, but there are a bunch of
 things we are working on. One of the main features is to expose a data
 frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
 be integrating this with Spark's MLLib.  At a high-level this will
 allow R users to use a familiar API but make use of MLLib's efficient
 distributed implementation. This is the same strategy used in Python
 as well.

 Also we do hope to merge SparkR with mainline Spark -- we have a few
 features to complete before that and plan to shoot for integration by
 Spark 1.3.

 Thanks
 Shivaram

 On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote:
  Thanks, Shivaram.
 
  No specific use case yet. We try to use R in our project as data
 scientest
  are all knowing R. We had a concern that how R handles the mass data.
 Spark
  does a better work on big data area, and Spark ML is focusing on
 predictive
  analysis area. Then we are thinking whether we can merge R and Spark
  together. We tried SparkR and it is pretty easy to use. But we didn’t
 see
  any feedback on this package in industry. It will be better if Spark
 team
  has R support just like scala/Java/Python.
 
  Another question is that MLlib will re-implement all famous data mining
  algorithms in Spark, then what is the purpose of using R?
 
  There is another technique for us H2O which support R natively. H2O is
 more
  friendly to data scientist. I saw H2O can also work on Spark (Sparkling
  Water).  It is better than using SparkR?
 
  Thanks and Regards.
 
  Kui
 
 
  On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
 
  Hi
 
  Do you have a specific use-case where SparkR doesn't work well ? We'd
 love
  to hear more about use-cases and features that can be improved with
 SparkR.
 
  Thanks
  Shivaram
 
 
  On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote:
 
  Does spark ML team have plan to support R script natively? There is a
  SparkR project, but not from spark team. Spark ML used netlib-java to
 talk
  with native fortran routines or use NumPy, why not try to use R in some
  sense.
 
  R had lot of useful packages. If spark ML team can include R support,
 it
  will be a very powerful.
 
  Any comment?
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-06 Thread Victor Tso-Guillen
I ran into the same issue. What I did was use maven shade plugin to shade
my version of httpcomponents libraries into another package.


On Fri, Sep 5, 2014 at 4:33 PM, Penny Espinoza 
pesp...@societyconsulting.com wrote:

  Hey - I’m struggling with some dependency issues with
 org.apache.httpcomponents httpcore and httpclient when using spark-submit
 with YARN running Spark 1.0.2 on a Hadoop 2.2 cluster.  I’ve seen several
 posts about this issue, but no resolution.

  The error message is this:


  Caused by: java.lang.NoSuchMethodError:
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
 at
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
 at
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
 at
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
 at
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:85)
 at
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:93)
 at
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
 at
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96)
 at
 com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:155)
 at
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:118)
 at
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:102)
 at
 com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:332)
 at
 com.oncue.rna.realtime.streaming.config.package$.transferManager(package.scala:76)
 at
 com.oncue.rna.realtime.streaming.models.S3SchemaRegistry.init(SchemaRegistry.scala:27)
 at
 com.oncue.rna.realtime.streaming.models.S3SchemaRegistry$.schemaRegistry$lzycompute(SchemaRegistry.scala:46)
 at
 com.oncue.rna.realtime.streaming.models.S3SchemaRegistry$.schemaRegistry(SchemaRegistry.scala:44)
 at
 com.oncue.rna.realtime.streaming.coders.KafkaAvroDecoder.init(KafkaAvroDecoder.scala:20)
 ... 17 more

   The apache httpcomponents libraries include the method above as of
 version 4.2.  The Spark 1.0.2 binaries seem to include version 4.1.

  I can get this to work in my driver program by adding exclusions to
 force use of 4.1, but then I get the error in tasks even when using the
 —jars option of the spark-submit command.  How can I get both the driver
 program and the individual tasks in my spark-streaming job to use the same
 version of this library so my job will run all the way through?

  thanks
 p