Re: can spark take advantage of ordered data?

2017-03-10 Thread sourabh chaki
My use case is also quite similar. I have 2 feeds. One 3TB and another
100GB. Both the feeds are generated by hadoop reduce operation and
partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas
100GB file has 200 partitions.

Now when I do a join between these two feeds using spark, spark shuffles
both the RDDS and it takes long time to complete. Can we do something so
that spark can recognise the existing partitions of 3TB feed and shuffles
only 200GB feed?
It can be mapside scan for bigger RDD and shuffle read from smaller RDD?

I have looked at spark-sorted project, but that project does not utilise
the pre-existing partitions in the feed.
Any pointer will be helpful.

Thanks
Sourabh

On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid  wrote:

> Hi Jonathan,
>
> you might be interested in https://issues.apache.org/
> jira/browse/SPARK-3655 (not yet available) and https://github.com/
> tresata/spark-sorted (not part of spark, but it is available right now).
> Hopefully thats what you are looking for.  To the best of my knowledge that
> covers what is available now / what is being worked on.
>
> Imran
>
> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney 
> wrote:
>
>> Hello all,
>>
>> I am wondering if spark already has support for optimizations on sorted
>> data and/or if such support could be added (I am comfortable dropping to a
>> lower level if necessary to implement this, but I'm not sure if it is
>> possible at all).
>>
>> Context: we have a number of data sets which are essentially already
>> sorted on a key. With our current systems, we can take advantage of this to
>> do a lot of analysis in a very efficient fashion...merges and joins, for
>> example, can be done very efficiently, as can folds on a secondary key and
>> so on.
>>
>> I was wondering if spark would be a fit for implementing these sorts of
>> optimizations? Obviously it is sort of a niche case, but would this be
>> achievable? Any pointers on where I should look?
>>
>
>


Re: JAVA_HOME problem

2015-04-28 Thread sourabh chaki
I was able to solve this problem hard coding the JAVA_HOME inside
org.apache.spark.deploy.yarn.Client.scala class.




*val commands = prefixEnv ++ Seq(--
YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) +
/bin/java, -server++ /usr/java/jdk1.7.0_51/bin/java, -server)*

Somehow {{JAVA_HOME}}  was not getting resolved in the node of yarn
container. This change has fixed the problem. Now I am getting a new
error.

*Container: container_1430123808466_36297_02_01
===
LogType: stderr
LogLength: 87
Log Contents:
Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher

LogType: stdout
LogLength: 0
Log Contents:*

Looks like now classpath variables are not resolved in yarn node. I
have mapreduce jobs running in the same cluster  working without any
problem. Any pointer why this could happen?


Thanks

Sourabh


On Fri, Apr 24, 2015 at 3:52 PM, sourabh chaki chaki.sour...@gmail.com
wrote:

 Yes Akhil. This is the same issue. I have updated my comment in that
 ticket.

 Thanks
 Sourabh

 On Fri, Apr 24, 2015 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Isn't this related to this
 https://issues.apache.org/jira/browse/SPARK-6681

 Thanks
 Best Regards

 On Fri, Apr 24, 2015 at 11:40 AM, sourabh chaki chaki.sour...@gmail.com
 wrote:

 I am also facing the same problem with spark 1.3.0 and yarn-client and
 yarn-cluster mode. Launching yarn container failed and this is the error in
 stderr:

 Container: container_1429709079342_65869_01_01

 ===
 LogType: stderr
 LogLength: 61
 Log Contents:
 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

 LogType: stdout
 LogLength: 0
 Log Contents:

 I have added JAVA_HOME in hadoop-env.sh as well spark-env.sh
 grep JAVA_HOME /etc/hadoop/conf.cloudera.yarn/hadoop-env.sh
 export JAVA_HOME=/usr/java/default
 export PATH=$PATH:$JAVA_HOME/bin/java

 grep JAVA_HOME /var/spark/spark-1.3.0-bin-hadoop2.4/conf/spark-env.sh
 export JAVA_HOME=/usr/java/default

 I could see another thread for the same problem but I dont see any
 solution.

 http://stackoverflow.com/questions/29170280/java-home-error-with-upgrade-to-spark-1-3-0
  Any pointer will be helpful.

 Thanks
 Sourabh


 On Thu, Apr 2, 2015 at 1:23 PM, 董帅阳 917361...@qq.com wrote:

 spark 1.3.0


 spark@pc-zjqdyyn1:~ tail /etc/profile
 export JAVA_HOME=/usr/jdk64/jdk1.7.0_45
 export PATH=$PATH:$JAVA_HOME/bin

 #
 # End of /etc/profile
 #‍


 But ERROR LOG

 Container: container_1427449644855_0092_02_01 on pc-zjqdyy04_45454
 
 LogType: stderr
 LogLength: 61
 Log Contents:
 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

 LogType: stdout
 LogLength: 0
 Log Contents:‍







Re: JAVA_HOME problem

2015-04-24 Thread sourabh chaki
Yes Akhil. This is the same issue. I have updated my comment in that ticket.

Thanks
Sourabh

On Fri, Apr 24, 2015 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Isn't this related to this
 https://issues.apache.org/jira/browse/SPARK-6681

 Thanks
 Best Regards

 On Fri, Apr 24, 2015 at 11:40 AM, sourabh chaki chaki.sour...@gmail.com
 wrote:

 I am also facing the same problem with spark 1.3.0 and yarn-client and
 yarn-cluster mode. Launching yarn container failed and this is the error in
 stderr:

 Container: container_1429709079342_65869_01_01

 ===
 LogType: stderr
 LogLength: 61
 Log Contents:
 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

 LogType: stdout
 LogLength: 0
 Log Contents:

 I have added JAVA_HOME in hadoop-env.sh as well spark-env.sh
 grep JAVA_HOME /etc/hadoop/conf.cloudera.yarn/hadoop-env.sh
 export JAVA_HOME=/usr/java/default
 export PATH=$PATH:$JAVA_HOME/bin/java

 grep JAVA_HOME /var/spark/spark-1.3.0-bin-hadoop2.4/conf/spark-env.sh
 export JAVA_HOME=/usr/java/default

 I could see another thread for the same problem but I dont see any
 solution.

 http://stackoverflow.com/questions/29170280/java-home-error-with-upgrade-to-spark-1-3-0
  Any pointer will be helpful.

 Thanks
 Sourabh


 On Thu, Apr 2, 2015 at 1:23 PM, 董帅阳 917361...@qq.com wrote:

 spark 1.3.0


 spark@pc-zjqdyyn1:~ tail /etc/profile
 export JAVA_HOME=/usr/jdk64/jdk1.7.0_45
 export PATH=$PATH:$JAVA_HOME/bin

 #
 # End of /etc/profile
 #‍


 But ERROR LOG

 Container: container_1427449644855_0092_02_01 on pc-zjqdyy04_45454
 
 LogType: stderr
 LogLength: 61
 Log Contents:
 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

 LogType: stdout
 LogLength: 0
 Log Contents:‍






Re: JAVA_HOME problem

2015-04-24 Thread sourabh chaki
I am also facing the same problem with spark 1.3.0 and yarn-client and
yarn-cluster mode. Launching yarn container failed and this is the error in
stderr:

Container: container_1429709079342_65869_01_01
===
LogType: stderr
LogLength: 61
Log Contents:
/bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

LogType: stdout
LogLength: 0
Log Contents:

I have added JAVA_HOME in hadoop-env.sh as well spark-env.sh
grep JAVA_HOME /etc/hadoop/conf.cloudera.yarn/hadoop-env.sh
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin/java

grep JAVA_HOME /var/spark/spark-1.3.0-bin-hadoop2.4/conf/spark-env.sh
export JAVA_HOME=/usr/java/default

I could see another thread for the same problem but I dont see any
solution.
http://stackoverflow.com/questions/29170280/java-home-error-with-upgrade-to-spark-1-3-0
 Any pointer will be helpful.

Thanks
Sourabh


On Thu, Apr 2, 2015 at 1:23 PM, 董帅阳 917361...@qq.com wrote:

 spark 1.3.0


 spark@pc-zjqdyyn1:~ tail /etc/profile
 export JAVA_HOME=/usr/jdk64/jdk1.7.0_45
 export PATH=$PATH:$JAVA_HOME/bin

 #
 # End of /etc/profile
 #‍


 But ERROR LOG

 Container: container_1427449644855_0092_02_01 on pc-zjqdyy04_45454
 
 LogType: stderr
 LogLength: 61
 Log Contents:
 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

 LogType: stdout
 LogLength: 0
 Log Contents:‍



Re: train many decision tress with a single spark job

2015-01-13 Thread sourabh chaki
Hi Josh,

I was trying out decision tree ensemble using bagging. Here I am spiting
the input using random split and training tree for each of the split. Here
is sample code:

val bags : Int = 10
val models : Array[DecisionTreeModel]  =
training.randomSplit(Array.fill(bags)(1.0 / bags)).map {
  (data) = DecisionTree.trainClassifier(toLabelPoints(data))
}
def toLablePoint(data: RDD[Double]) : RDD[LabeledPoint] = {
// convert data RDD to lablepoint RDD
}

For your case, I think, you need custom logic to split the dataset.

Thanks
Sourabh


On Tue, Jan 13, 2015 at 3:55 PM, Sean Owen so...@cloudera.com wrote:

 OK, I still wonder whether it's not better to make one big model. The
 usual assumption is that the user's identity isn't predictive per se.
 If every customer in your shop is truly unlike the others, most
 predictive analytics goes out the window. It's factors like our
 location, income, etc that are predictive and there aren't a million
 of those.

 But let's say it's so and you really need 1M RDDs. I think I'd just
 repeatedly filter the source RDD. That really won't be the slow step.
 I think the right way to do it is to create a list of all user IDs on
 the driver, turn it into a parallel collection (and override the # of
 threads it uses on the driver to something reasonable) and map each
 one to the result of filtering and modeling that user subset.

 The problem is just the overhead of scheduling millions and millions
 of tiny modeling jobs. It will still probably take a long time. Could
 be fine if you have still millions of data points per user. It's even
 appropriate. But then the challenge here is that you're processing
 trillions of data points! that will be fun.

 I think any distributed system is overkill and not designed for the
 case where data fits into memory. You can always take a local
 collection and call parallelize to make it into an RDD, so in that
 sense Spark can handle a tiny data set if you really want.

 I'm still not sure I've seen a case where you want to partition by
 user but trust you really need that.

 On Tue, Jan 13, 2015 at 1:30 AM, Josh Buffum jbuf...@gmail.com wrote:
  You are right... my code example doesn't work :)
 
  I actually do want a decision tree per user. So, for 1 million users, I
 want
  1 million trees. We're training against time series data, so there are
 still
  quite a few data points per users. My previous message where I mentioned
  RDDs with no length was, I think, a result of the way the random
  partitioning worked (I was partitioning into N groups where N was the
 number
  of users... total).
 
  Given this, I'm thinking the mlllib is not designed for this particular
  case? It appears optimized for training across large datasets. I was just
  hoping to leverage it since creating my feature sets for the users was
  already in Spark.
 
 
  On Mon, Jan 12, 2015 at 5:05 PM, Sean Owen so...@cloudera.com wrote:
 
  A model partitioned by users?
 
  I mean that if you have a million users surely you don't mean to build a
  million models. There would be little data per user right? Sounds like
 you
  have 0 sometimes.
 
  You would typically be generalizing across users not examining them in
  isolation. Models are built on thousands or millions of data points.
 
  I assumed you were subsetting for cross validation in which case we are
  talking about making more like say 10 models. You usually take random
  subsets. But it might be as fine to subset as a function of a user ID
 if you
  like. Or maybe you do have some reason for segregating users and
 modeling
  them differently (e.g. different geographies or something).
 
  Your code doesn't work as is since you are using RDDs inside RDDs. But I
  am also not sure you should do what it looks like you are trying to do.
 
  On Jan 13, 2015 12:32 AM, Josh Buffum jbuf...@gmail.com wrote:
 
  Sean,
 
  Thanks for the response. Is there some subtle difference between one
  model partitioned by N users or N models per each 1 user? I think I'm
  missing something with your question.
 
  Looping through the RDD filtering one user at a time would certainly
 give
  me the response that I am hoping for (i.e a map of user =
 decisiontree),
  however, that seems like it would yield poor performance? The userIDs
 are
  not integers, so I either need to iterator through some in-memory
 array of
  them (could be quite large) or have some distributed lookup table.
 Neither
  seem great.
 
  I tried the random split thing. I wonder if I did something wrong
 there,
  but some of the splits got RDDs with 0 tuples and some got RDDs with 
 1
  tuple. I guess that's to be expected with some random distribution?
 However,
  that won't work for me since it breaks the one tree per user thing. I
  guess I could randomly distribute user IDs and then do the scan
 everything
  and filter step...
 
  How bad of an idea is it to do:
 
  data.groupByKey.map( kvp = {
val (key, data) = kvp
val tree = 

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread sourabh chaki
Hi Albert,
There is some discussion going on here:
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-model-export-PMML-vs-MLLIB-serialization-tc20324.html#a20674
I am also looking for this solution.But looks like until mllib pmml export
is ready, there is no full proof solution to export the mllib trained model
to a different system.

Thanks
Sourabh

On Mon, Dec 15, 2014 at 10:39 PM, Albert Manyà alber...@eml.cc wrote:

 In that case, what is the strategy to train a model in some background
 batch process and make recommendations for some other service in real
 time? Run both processes in the same spark cluster?

 Thanks.

 --
   Albert Manyà
   alber...@eml.cc

 On Mon, Dec 15, 2014, at 05:58 PM, Sean Owen wrote:
  This class is not going to be serializable, as it contains huge RDDs.
  Even if the right constructor existed the RDDs inside would not
  serialize.
 
  On Mon, Dec 15, 2014 at 4:33 PM, Albert Manyà alber...@eml.cc wrote:
   Hi all.
  
   I'm willing to serialize and later load a model trained using mllib's
   ALS.
  
   I've tried usign Java serialization with something like:
  
   val model = ALS.trainImplicit(training, rank, numIter, lambda, 1)
   val fos = new FileOutputStream(model.bin)
   val oos = new ObjectOutputStream(fos)
   oos.writeObject(bestModel.get)
  
   But when I try to deserialize it using:
  
   val fos = new FileInputStream(model.bin)
   val oos = new ObjectInputStream(fos)
   val model = oos.readObject().asInstanceOf[MatrixFactorizationModel]
  
I get the error:
  
   Exception in thread main java.io.IOException: PARSING_ERROR(2)
  
   I've also tried to serialize MatrixFactorizationModel's both RDDs
   (products and users) and later create the MatrixFactorizationModel by
   hand passing the RDDs by constructor but I get an error cause its
   private:
  
   Error:(58, 17) constructor MatrixFactorizationModel in class
   MatrixFactorizationModel cannot be accessed in object RecommendALS
   val model = new MatrixFactorizationModel (8, userFeatures,
   productFeatures)
  
   Any ideas?
  
   Thanks!
  
   --
 Albert Manyà
 alber...@eml.cc
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org