回复: spark broadcast unavailable

2014-12-10 Thread 十六夜涙
Hi All,
I'v read official docs of tachyon,It seems not fit my usage,For my 
understanding,‍It just cache files in memory,but I have a file contains over 
million lines amount about 70mb,retrieveing data and mapping to a Map varible 
will costs over serveral minuts,which I dont want to process it each time in 
map function.since tachyon occurs another problem raise an exception while 
doing ./bin/tachyon format
The exception:
Exception in thread "main" java.lang.RuntimeException: 
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate 
with client version 4

‍It seems there's a compatibility problem with hadoop,but even solved it 
there's still an efficient issue as I described above.‍‍
could somebody tell me how to  persist the data in memory.for now I just 
broadcast it, and re-submit spark application while the broadcast value 
unavaible.‍






-- 原始邮件 --
发件人: "Akhil Das";;
发送时间: 2014年12月9日(星期二) 下午3:42
收件人: "十六夜涙"; 
抄送: "user"; 
主题: Re: spark broadcast unavailable



You cannot pass the sc object (val b = Utils.load(sc,ip_lib_path)) inside a map 
function and that's why the Serialization exception is popping up( since sc is 
not serializable). You can try tachyon's cache if you want to persist the data 
in memory kind of forever.


ThanksBest Regards



 
On Tue, Dec 9, 2014 at 12:12 PM, 十六夜涙  wrote:
Hi allIn my spark application,I load a csv file and map the datas to a Map 
vairable for later uses on driver node ,then broadcast it,every thing works 
fine untill the exception java.io.FileNotFoundException occurs.the console log 
information shows me the broadcast unavailable,I googled this problem,says 
spark will  clean up the broadcast,while these's an solution,the author 
mentioned about re-broadcast,I followed this way,written some exception handle 
code with `try` ,`catch`.after compliling and submitting the jar,I faced 
anthoner problem,It shows " task not serializable‍".‍‍‍
so here I have  there options:
1,get the right way persisting broadcast
2,solve the "task not serializable" problem re-broadcast variable
3,save the data to some kind of database,although I prefer save data in memory.


here is come code snippets:
  val esRdd = kafkaDStreams.flatMap(_.split("\\n"))
  .map{
  case esregex(datetime, time_request) =>


var ipInfo:Array[String]=Array.empty
try{
ipInfo = Utils.getIpInfo(client_ip,b.value)
}catch{
  case e:java.io.FileNotFoundException =>{
val b = Utils.load(sc,ip_lib_path)
ipInfo = Utils.getIpInfo(client_ip,b.value)
  }
}
‍

Re: Mllib error

2014-12-10 Thread Ritesh Kumar Singh
How did you build your spark 1.1.1 ?

On Wed, Dec 10, 2014 at 10:41 AM, amin mohebbi 
wrote:

> I'm trying to build a very simple scala standalone app using the Mllib,
> but I get the following error when trying to bulid the program:
>
> Object mllib is not a member of package org.apache.spark
>
>
> please note I just migrated from 1.0.2 to 1.1.1
>
>
>
> Best Regards
>
> ...
>
> Amin Mohebbi
>
> PhD candidate in Software Engineering
>  at university of Malaysia
>
> Tel : +60 18 2040 017
>
>
>
> E-Mail : tp025...@ex.apiit.edu.my
>
>   amin_...@me.com
>


Re: KafkaUtils explicit acks

2014-12-10 Thread Mukesh Jha
Hello Guys,

Any insights on this??
If I'm not clear enough my question is how can I use kafka consumer and not
loose any data in cases of failures with spark-streaming.

On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha  wrote:

> Hello Experts,
>
> I'm working on a spark app which reads data from kafka & persists it in
> hbase.
>
> Spark documentation states the below *[1]* that in case of worker failure
> we can loose some data. If not how can I make my kafka stream more reliable?
> I have seen there is a simple consumer *[2]* but I'm not sure if it has
> been used/tested extensively.
>
> I was wondering if there is a way to explicitly acknowledge the kafka
> offsets once they are replicated in memory of other worker nodes (if it's
> not already done) to tackle this issue.
>
> Any help is appreciated in advance.
>
>
>1. *Using any input source that receives data through a network* - For
>network-based data sources like *Kafka *and Flume, the received input
>data is replicated in memory between nodes of the cluster (default
>replication factor is 2). So if a worker node fails, then the system can
>recompute the lost from the the left over copy of the input data. However,
>if the *worker node where a network receiver was running fails, then a
>tiny bit of data may be lost*, that is, the data received by the
>system but not yet replicated to other node(s). The receiver will be
>started on a different node and it will continue to receive data.
>2. https://github.com/dibbhatt/kafka-spark-consumer
>
> Txz,
>
> *Mukesh Jha *
>



-- 


Thanks & Regards,

*Mukesh Jha *


Re: Mllib native netlib-java/OpenBLAS

2014-12-10 Thread Guillaume Pitel

Hi,

I had the same problem, and tried to compile with mvn -Pnetlib-lgpl

$ mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean 
package


Unfortunately, the resulting assembly jar still lacked the netlib-system class. 
This command :


$ jar tvf assembly/target/scala-2.10/spark-assembly-1.1.1-hadoop2.3.0.jar |grep 
netlib | grep Native


returns nothing...

(and for some reason, including the netlib-all in my shipped jar did not solve 
the problem either, apparently the classloader does not find the class)


In Spark, the profile is defined in mllib submodule, but the -Pnetlib-lgpl seems 
not to be transmitted to the child from the parent pom.xml


I don't know how to fix that cleanly (I just added 
true in mllib's pom.xml), maybe it's just a 
problem with my maven version (3.0.5)


Guillaume


I tried building Spark from the source, by downloading it and running:

mvn -Pnetlib-lgpl -DskipTests clean package



--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. 
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Actor System Corrupted!

2014-12-10 Thread Stephen Samuel (Sam)
Hi all,

Having a strange issue that I can't find any previous issues for on the
mailing list or stack overflow.

Frequently we are getting "ACTOR SYSTEM CORRUPTED!! A Dispatcher can't have
less than 0 inhabitants!" with a stack trace, from akka, in the executor
logs, and the executor is marked as EXITED. This happens more than it
doesn't and renders our spark process unusable.

We're stumped on what this could be. The only thing I can think of is that
we have another actor system in the same process, that perhaps is
conflicting? (I can't see any documentation that says we can't do this
though).

Any thoughts anyone?


MLLib in Production

2014-12-10 Thread Klausen Schaefersinho
Hi,


I would like to use Spark to train a model, but use the model in some other
place,, e.g. a servelt to do some classification in real time.

What is the best way to do this? Can I just copy I model file or something
and load it in the servelt? Can anybody point me to a good tutorial?


Cheers,


Klaus



-- 
“Overfitting” is not about an excessive amount of physical exercise...


Re: MLLib in Production

2014-12-10 Thread Simon Chan
Hi Klaus,

PredictionIO is an open source product based on Spark MLlib for exactly
this purpose.
This is the tutorial for classification in particular:
http://docs.prediction.io/classification/quickstart/

You can add custom serving logics and retrieve prediction result through
REST API/SDKs at other places.

Simon

On Wed, Dec 10, 2014 at 2:25 AM, Klausen Schaefersinho <
klaus.schaef...@gmail.com> wrote:

> Hi,
>
>
> I would like to use Spark to train a model, but use the model in some
> other place,, e.g. a servelt to do some classification in real time.
>
> What is the best way to do this? Can I just copy I model file or something
> and load it in the servelt? Can anybody point me to a good tutorial?
>
>
> Cheers,
>
>
> Klaus
>
>
>
> --
> “Overfitting” is not about an excessive amount of physical exercise...
>


Re: KafkaUtils explicit acks

2014-12-10 Thread francois . garillot
Hi Mukesh,




There’s been some great work on Spark Streaming reliability lately




I’m not aware of any doc yet (did I miss something ?) but you can look at the 
ReliableKafkaReceiver’s test suite:





—
FG

On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha 
wrote:

> Hello Guys,
> Any insights on this??
> If I'm not clear enough my question is how can I use kafka consumer and not
> loose any data in cases of failures with spark-streaming.
> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha  wrote:
>> Hello Experts,
>>
>> I'm working on a spark app which reads data from kafka & persists it in
>> hbase.
>>
>> Spark documentation states the below *[1]* that in case of worker failure
>> we can loose some data. If not how can I make my kafka stream more reliable?
>> I have seen there is a simple consumer *[2]* but I'm not sure if it has
>> been used/tested extensively.
>>
>> I was wondering if there is a way to explicitly acknowledge the kafka
>> offsets once they are replicated in memory of other worker nodes (if it's
>> not already done) to tackle this issue.
>>
>> Any help is appreciated in advance.
>>
>>
>>1. *Using any input source that receives data through a network* - For
>>network-based data sources like *Kafka *and Flume, the received input
>>data is replicated in memory between nodes of the cluster (default
>>replication factor is 2). So if a worker node fails, then the system can
>>recompute the lost from the the left over copy of the input data. However,
>>if the *worker node where a network receiver was running fails, then a
>>tiny bit of data may be lost*, that is, the data received by the
>>system but not yet replicated to other node(s). The receiver will be
>>started on a different node and it will continue to receive data.
>>2. https://github.com/dibbhatt/kafka-spark-consumer
>>
>> Txz,
>>
>> *Mukesh Jha *
>>
> -- 
> Thanks & Regards,
> *Mukesh Jha *

Re: KafkaUtils explicit acks

2014-12-10 Thread francois . garillot
[sorry for the botched half-message]




Hi Mukesh,




There’s been some great work on Spark Streaming reliability lately.

https://www.youtube.com/watch?v=jcJq3ZalXD8


Look at the links from:

https://issues.apache.org/jira/browse/SPARK-3129








I’m not aware of any doc yet (did I miss something ?) but you can look at the 
ReliableKafkaReceiver’s test suite:






external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala


—
FG

On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha 
wrote:

> Hello Guys,
> Any insights on this??
> If I'm not clear enough my question is how can I use kafka consumer and not
> loose any data in cases of failures with spark-streaming.
> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha  wrote:
>> Hello Experts,
>>
>> I'm working on a spark app which reads data from kafka & persists it in
>> hbase.
>>
>> Spark documentation states the below *[1]* that in case of worker failure
>> we can loose some data. If not how can I make my kafka stream more reliable?
>> I have seen there is a simple consumer *[2]* but I'm not sure if it has
>> been used/tested extensively.
>>
>> I was wondering if there is a way to explicitly acknowledge the kafka
>> offsets once they are replicated in memory of other worker nodes (if it's
>> not already done) to tackle this issue.
>>
>> Any help is appreciated in advance.
>>
>>
>>1. *Using any input source that receives data through a network* - For
>>network-based data sources like *Kafka *and Flume, the received input
>>data is replicated in memory between nodes of the cluster (default
>>replication factor is 2). So if a worker node fails, then the system can
>>recompute the lost from the the left over copy of the input data. However,
>>if the *worker node where a network receiver was running fails, then a
>>tiny bit of data may be lost*, that is, the data received by the
>>system but not yet replicated to other node(s). The receiver will be
>>started on a different node and it will continue to receive data.
>>2. https://github.com/dibbhatt/kafka-spark-consumer
>>
>> Txz,
>>
>> *Mukesh Jha *
>>
> -- 
> Thanks & Regards,
> *Mukesh Jha *

Maven profile in MLLib netlib-lgpl not working (1.1.1)

2014-12-10 Thread Guillaume Pitel

Hi

Issue created https://issues.apache.org/jira/browse/SPARK-4816

Probably a maven-related question for profiles in child modules

I couldn't find a clean solution, just a workaround : modify pom.xml in 
mllib module to force activation of netlib-lgpl module.


Hope a maven expert will help.

Guillaume

+1 with 1.3-SNAPSHOT.

On Mon, Dec 1, 2014 at 5:49 PM, agg212 > wrote:


Thanks for your reply, but I'm still running into issues
installing/configuring the native libraries for MLlib.  Here are
the steps
I've taken, please let me know if anything is incorrect.

- Download Spark source
- unzip and compile using `mvn -Pnetlib-lgpl -DskipTests clean
package `
- Run `sbt/sbt publish-local`

The last step fails with the following error (full stack trace is
attached
here:  error.txt

):
[error] (sql/compile:compile) java.lang.AssertionError: assertion
failed:
List(object package$DebugNode, object package$DebugNode)

Do I still have to install OPENBLAS/anything else if I build Spark
from the
source using the -Pnetlib-lgpl flag?  Also, do I change the Spark
version
(from 1.1.0 to 1.2.0-SNAPSHOT) in the .sbt file for my app?

Thanks!



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p20110.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org






--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. 
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Re: MLLib in Production

2014-12-10 Thread Yanbo Liang
Hi Klaus,

There is no ideal method but some workaround.
Train model in Spark cluster or YARN cluster, then use RDD.saveAsTextFile
to store this model which include weights and intercept to HDFS.
Load weights file and intercept file from HDFS, construct a GLM model, and
then run model.predict() method, you can get what you want.

The Spark community also have some ongoing work about export model with
PMML.

2014-12-10 18:32 GMT+08:00 Simon Chan :

> Hi Klaus,
>
> PredictionIO is an open source product based on Spark MLlib for exactly
> this purpose.
> This is the tutorial for classification in particular:
> http://docs.prediction.io/classification/quickstart/
>
> You can add custom serving logics and retrieve prediction result through
> REST API/SDKs at other places.
>
> Simon
>
>
> On Wed, Dec 10, 2014 at 2:25 AM, Klausen Schaefersinho <
> klaus.schaef...@gmail.com> wrote:
>
>> Hi,
>>
>>
>> I would like to use Spark to train a model, but use the model in some
>> other place,, e.g. a servelt to do some classification in real time.
>>
>> What is the best way to do this? Can I just copy I model file or
>> something and load it in the servelt? Can anybody point me to a good
>> tutorial?
>>
>>
>> Cheers,
>>
>>
>> Klaus
>>
>>
>>
>> --
>> “Overfitting” is not about an excessive amount of physical exercise...
>>
>
>


flatMap and spilling of output to disk

2014-12-10 Thread Johannes Simon
Hi!

I have been using spark a lot recently and it's been running really well and 
fast, but now when I increase the data size, it's starting to run into problems:
I have an RDD in the form of (String, Iterable[String]) - the Iterable[String] 
was produced by a groupByKey() - and I perform a flatMap on it that outputs 
some form of cartesian product of the values per key:


rdd.flatMap({case (key, values) => for(v1 <- values; v2 <- values) yield ((v1, 
v2), 1)})


So the runtime cost per RDD entry is O(n^2) where n is the number of values. 
This n can sometimes be 10,000 or even 100,000. That produces a lot of data, I 
am aware of that, but otherwise I wouldn't need a cluster, would I? :) For 
n<=1000 this operation works quite well. But as soon as I allow n to be <= 
10,000 or higher, I start to get "GC overhead limit exceeded" exceptions.

Configuration:
- 7g executor memory
- spark.shuffle.memoryFraction=0.5
- spark.storage.memoryFraction=0.1
I am not sure how the remaining memory for the actual JVM instance performing 
the flatMap is computed, but I would assume it to be something like 
(1-0.5-0.1)*7g = 2.8g

Now I am wondering: Why should these 2.8g (or say even a few hundred MB) not 
suffice for spark to process this flatMap without too much GC overhead? If I 
assume a string to be 10 characters on average, therefore consuming about 60 
bytes with overhead taken into account, then 10,000 of these values sum up to 
no more than ~600kb, and apart from that spark never has to keep anything else 
in memory.

My question: When does spark start to spill RDD entries to disk, assuming that 
no RDD is to be persisted? Does it keep all output of the flatMap operation in 
memory until the entire flatMap is done? Or does it already spill every single 
yielded "((v1, v2), 1)" entry out to disk if necessary?

Thanks a lot!
Johannes
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: flatMap and spilling of output to disk

2014-12-10 Thread Sean Owen
You are rightly thinking that Spark should be able to just "stream"
this massive collection of pairs you are creating, and never need to
put it all in memory. That's true, but, your function actually creates
a huge collection of pairs in memory before Spark ever touches it.

This is going to materialize all pairs in a collection:

for(v1 <- values; v2 <- values) yield ((v1, v2), 1)

In fact, you only need to return a TraversableOnce to Spark. An
Iterator is just fine, for example.

Others probably have a better suggestion, but, I think you can try
something along these lines instead:

values.toIterator.flatMap(v1 => values.toIterator.map(v2 => ((v1, v2), 1)))

That should mean you hold at most N tuples in memory (depends on how
flatMap works I guess), not N^2.

That's the problem and I think there is a solution along these lines,
but I haven't tested it or thought about it for more than 2 minutes.

After that you should allow other behaviors from Spark to help you.

On Wed, Dec 10, 2014 at 12:13 PM, Johannes Simon  wrote:
> Hi!
>
> I have been using spark a lot recently and it's been running really well and 
> fast, but now when I increase the data size, it's starting to run into 
> problems:
> I have an RDD in the form of (String, Iterable[String]) - the 
> Iterable[String] was produced by a groupByKey() - and I perform a flatMap on 
> it that outputs some form of cartesian product of the values per key:
>
>
> rdd.flatMap({case (key, values) => for(v1 <- values; v2 <- values) yield 
> ((v1, v2), 1)})
>
>
> So the runtime cost per RDD entry is O(n^2) where n is the number of values. 
> This n can sometimes be 10,000 or even 100,000. That produces a lot of data, 
> I am aware of that, but otherwise I wouldn't need a cluster, would I? :) For 
> n<=1000 this operation works quite well. But as soon as I allow n to be <= 
> 10,000 or higher, I start to get "GC overhead limit exceeded" exceptions.
>
> Configuration:
> - 7g executor memory
> - spark.shuffle.memoryFraction=0.5
> - spark.storage.memoryFraction=0.1
> I am not sure how the remaining memory for the actual JVM instance performing 
> the flatMap is computed, but I would assume it to be something like 
> (1-0.5-0.1)*7g = 2.8g
>
> Now I am wondering: Why should these 2.8g (or say even a few hundred MB) not 
> suffice for spark to process this flatMap without too much GC overhead? If I 
> assume a string to be 10 characters on average, therefore consuming about 60 
> bytes with overhead taken into account, then 10,000 of these values sum up to 
> no more than ~600kb, and apart from that spark never has to keep anything 
> else in memory.
>
> My question: When does spark start to spill RDD entries to disk, assuming 
> that no RDD is to be persisted? Does it keep all output of the flatMap 
> operation in memory until the entire flatMap is done? Or does it already 
> spill every single yielded "((v1, v2), 1)" entry out to disk if necessary?
>
> Thanks a lot!
> Johannes
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: flatMap and spilling of output to disk

2014-12-10 Thread Shixiong Zhu
for(v1 <- values; v2 <- values) yield ((v1, v2), 1) will generate all data
at once and return all of them to flatMap.

To solve your problem, you should use for (v1 <- values.iterator; v2 <-
values.iterator) yield ((v1, v2), 1) which will generate the data when it’s
necessary.
​

Best Regards,
Shixiong Zhu

2014-12-10 20:13 GMT+08:00 Johannes Simon :

> Hi!
>
> I have been using spark a lot recently and it's been running really well
> and fast, but now when I increase the data size, it's starting to run into
> problems:
> I have an RDD in the form of (String, Iterable[String]) - the
> Iterable[String] was produced by a groupByKey() - and I perform a flatMap
> on it that outputs some form of cartesian product of the values per key:
>
>
> rdd.flatMap({case (key, values) => for(v1 <- values; v2 <- values) yield
> ((v1, v2), 1)})
>
>
> So the runtime cost per RDD entry is O(n^2) where n is the number of
> values. This n can sometimes be 10,000 or even 100,000. That produces a lot
> of data, I am aware of that, but otherwise I wouldn't need a cluster, would
> I? :) For n<=1000 this operation works quite well. But as soon as I allow n
> to be <= 10,000 or higher, I start to get "GC overhead limit exceeded"
> exceptions.
>
> Configuration:
> - 7g executor memory
> - spark.shuffle.memoryFraction=0.5
> - spark.storage.memoryFraction=0.1
> I am not sure how the remaining memory for the actual JVM instance
> performing the flatMap is computed, but I would assume it to be something
> like (1-0.5-0.1)*7g = 2.8g
>
> Now I am wondering: Why should these 2.8g (or say even a few hundred MB)
> not suffice for spark to process this flatMap without too much GC overhead?
> If I assume a string to be 10 characters on average, therefore consuming
> about 60 bytes with overhead taken into account, then 10,000 of these
> values sum up to no more than ~600kb, and apart from that spark never has
> to keep anything else in memory.
>
> My question: When does spark start to spill RDD entries to disk, assuming
> that no RDD is to be persisted? Does it keep all output of the flatMap
> operation in memory until the entire flatMap is done? Or does it already
> spill every single yielded "((v1, v2), 1)" entry out to disk if necessary?
>
> Thanks a lot!
> Johannes
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: MLLib in Production

2014-12-10 Thread Sonal Goyal
You can also serialize the model and use it in other places.

Best Regards,
Sonal
Founder, Nube Technologies 





On Wed, Dec 10, 2014 at 5:32 PM, Yanbo Liang  wrote:

> Hi Klaus,
>
> There is no ideal method but some workaround.
> Train model in Spark cluster or YARN cluster, then use RDD.saveAsTextFile
> to store this model which include weights and intercept to HDFS.
> Load weights file and intercept file from HDFS, construct a GLM model, and
> then run model.predict() method, you can get what you want.
>
> The Spark community also have some ongoing work about export model with
> PMML.
>
> 2014-12-10 18:32 GMT+08:00 Simon Chan :
>
>> Hi Klaus,
>>
>> PredictionIO is an open source product based on Spark MLlib for exactly
>> this purpose.
>> This is the tutorial for classification in particular:
>> http://docs.prediction.io/classification/quickstart/
>>
>> You can add custom serving logics and retrieve prediction result through
>> REST API/SDKs at other places.
>>
>> Simon
>>
>>
>> On Wed, Dec 10, 2014 at 2:25 AM, Klausen Schaefersinho <
>> klaus.schaef...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> I would like to use Spark to train a model, but use the model in some
>>> other place,, e.g. a servelt to do some classification in real time.
>>>
>>> What is the best way to do this? Can I just copy I model file or
>>> something and load it in the servelt? Can anybody point me to a good
>>> tutorial?
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Klaus
>>>
>>>
>>>
>>> --
>>> “Overfitting” is not about an excessive amount of physical exercise...
>>>
>>
>>
>


Re: flatMap and spilling of output to disk

2014-12-10 Thread Johannes Simon
Hi!

Using an iterator solved the problem! I've been chewing on this for days, so 
thanks a lot to both of you!! :)

Since in an earlier version of my code, I used a self-join to perform the same 
thing, and ran into the same problems, I just looked at the implementation of 
PairRDDFunction.join (Spark v1.1.1):

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
  this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1; w <- pair._2) yield (v, w)
  )
}

Is there a reason to not use an iterator here if possible? Pardon my lack of 
Scala knowledge.. This should in any case cause the same problems I had when 
the size of vs/ws gets too large. (Though that question is more of a dev ml 
question)

Thanks!
Johannes

> Am 10.12.2014 um 13:44 schrieb Shixiong Zhu :
> 
> for(v1 <- values; v2 <- values) yield ((v1, v2), 1) will generate all data at 
> once and return all of them to flatMap.
> 
> To solve your problem, you should use for (v1 <- values.iterator; v2 <- 
> values.iterator) yield ((v1, v2), 1) which will generate the data when it’s 
> necessary.
> 
> 
> Best Regards,
> 
> Shixiong Zhu
> 
> 2014-12-10 20:13 GMT+08:00 Johannes Simon  >:
> Hi!
> 
> I have been using spark a lot recently and it's been running really well and 
> fast, but now when I increase the data size, it's starting to run into 
> problems:
> I have an RDD in the form of (String, Iterable[String]) - the 
> Iterable[String] was produced by a groupByKey() - and I perform a flatMap on 
> it that outputs some form of cartesian product of the values per key:
> 
> 
> rdd.flatMap({case (key, values) => for(v1 <- values; v2 <- values) yield 
> ((v1, v2), 1)})
> 
> 
> So the runtime cost per RDD entry is O(n^2) where n is the number of values. 
> This n can sometimes be 10,000 or even 100,000. That produces a lot of data, 
> I am aware of that, but otherwise I wouldn't need a cluster, would I? :) For 
> n<=1000 this operation works quite well. But as soon as I allow n to be <= 
> 10,000 or higher, I start to get "GC overhead limit exceeded" exceptions.
> 
> Configuration:
> - 7g executor memory
> - spark.shuffle.memoryFraction=0.5
> - spark.storage.memoryFraction=0.1
> I am not sure how the remaining memory for the actual JVM instance performing 
> the flatMap is computed, but I would assume it to be something like 
> (1-0.5-0.1)*7g = 2.8g
> 
> Now I am wondering: Why should these 2.8g (or say even a few hundred MB) not 
> suffice for spark to process this flatMap without too much GC overhead? If I 
> assume a string to be 10 characters on average, therefore consuming about 60 
> bytes with overhead taken into account, then 10,000 of these values sum up to 
> no more than ~600kb, and apart from that spark never has to keep anything 
> else in memory.
> 
> My question: When does spark start to spill RDD entries to disk, assuming 
> that no RDD is to be persisted? Does it keep all output of the flatMap 
> operation in memory until the entire flatMap is done? Or does it already 
> spill every single yielded "((v1, v2), 1)" entry out to disk if necessary?
> 
> Thanks a lot!
> Johannes
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Jaonary Rabarisoa
Dear all,

I'm trying to understand what is the correct use case of ColumnSimilarity
implemented in RowMatrix.

As far as I know, this function computes the similarity of a column of a
given matrix. The DIMSUM paper says that it's efficient for large m (rows)
and small n (columns). In this case the output will be a n by n matrix.

Now, suppose I want to compute similarity of several users, say m =
billions. Each users is described by a high dimensional feature vector, say
n = 1. In my dataset, one row represent one user. So in that case
computing the similarity my matrix is not the same as computing the
similarity of all users. Then, what does it mean computing the similarity
of the columns of my matrix in this case ?

Best regards,

Jao


Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Sean Owen
Well, you're computing similarity of your features then. Whether it is
meaningful depends a bit on the nature of your features and more on
the similarity algorithm.

On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa  wrote:
> Dear all,
>
> I'm trying to understand what is the correct use case of ColumnSimilarity
> implemented in RowMatrix.
>
> As far as I know, this function computes the similarity of a column of a
> given matrix. The DIMSUM paper says that it's efficient for large m (rows)
> and small n (columns). In this case the output will be a n by n matrix.
>
> Now, suppose I want to compute similarity of several users, say m =
> billions. Each users is described by a high dimensional feature vector, say
> n = 1. In my dataset, one row represent one user. So in that case
> computing the similarity my matrix is not the same as computing the
> similarity of all users. Then, what does it mean computing the similarity of
> the columns of my matrix in this case ?
>
> Best regards,
>
> Jao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



KryoException: Buffer overflow for very small input

2014-12-10 Thread JoeWass
I have narrowed down my problem to some code plus an input file with a single
very small input (one line). I'm getting a
"com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 14634430", but as the input is so small I think there's something
else up. I'm not sure what. Can anyone help?

I'm using the Flambo Clojure wrapper (but I don't think it makes much
difference) and spark-core_2.10 "1.1.1". 

Here's the program that crashes. It's in Clojure but it's very
straightforward as I've narrowed it down to the minum-crashing-example. If I
remove one (any) line it works. I understand that for each of the `map`s the
file will be re-read (all one line of it). 

I could just set `spark.kryoserializer.buffer.mb` to a large number, but I
don't think the default should break with such a small input (each object
I'm dealing with should weigh in at no more than 100 bytes) so I want to
understand what's wrong before tweaking the value.  was the maximum single
object size. 

Thanks for your help in advance.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KryoException-Buffer-overflow-for-very-small-input-tp20606.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to run spark function in a tomcat servlet

2014-12-10 Thread bai阿蒙
Hi guys:
I want to call the RDD api in a servlet which run in a tomcat. So I add the 
spark-core.jar to the web-inf/lib of the web project .And deploy it to tomcat.  
but In spark-core.jar there exist the httpserlet which belongs to jetty. then 
there is some conflict. Can Anybody tell me how to resolve it? thanks a lot

baishuo
  

Re: how to run spark function in a tomcat servlet

2014-12-10 Thread Alonso Isidoro Roman
Hi, i think this post

shall help you.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2014-12-10 17:19 GMT+01:00 bai阿蒙 :

> Hi guys:
> I want to call the RDD api in a servlet which run in a tomcat. So I add
> the spark-core.jar to the web-inf/lib of the web project .And deploy it to
> tomcat.  but In spark-core.jar there exist the httpserlet which belongs to
> jetty. then there is some conflict. Can Anybody tell me how to resolve it?
> thanks a lot
>
> baishuo
>


Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Debasish Das
If you have tall x skinny matrix of m users and n products, column
similarity will give you a n x n matrix (product x product matrix)...this
is also called product correlation matrix...it can be cosine, pearson or
other kind of correlations...Note that if the entry is unobserved (user
Joanary did not rate movie Top Gun) , column similarities will consider it
as implicit 0...

If you want similar users you want to generate a m x m matrix and you are
going towards kernel matrix...The general problem is to take a m x n matrix
that has n features and increase it to m features where m > ncosine for
linear kernel and RBF for non-linear kernel...

dimsum/col similarity map-reduce is not optimized for kernel matrix
generation..you need to look into map-reduce kernel matrix
generationthis kernel matrix can then help you answer similar users,
spectral clustering and kernel regression/classification/SVM if you have
labels...

A simplification to the problem is to take your m x n matrix and run
k-Means on it which will produce cluster of users..now for each user you
can compute closest in it's cluster...that drops down complexity from
O(m*m) to O(m*c) where c is the max number of user in each cluster...


On Wed, Dec 10, 2014 at 7:39 AM, Sean Owen  wrote:

> Well, you're computing similarity of your features then. Whether it is
> meaningful depends a bit on the nature of your features and more on
> the similarity algorithm.
>
> On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa 
> wrote:
> > Dear all,
> >
> > I'm trying to understand what is the correct use case of ColumnSimilarity
> > implemented in RowMatrix.
> >
> > As far as I know, this function computes the similarity of a column of a
> > given matrix. The DIMSUM paper says that it's efficient for large m
> (rows)
> > and small n (columns). In this case the output will be a n by n matrix.
> >
> > Now, suppose I want to compute similarity of several users, say m =
> > billions. Each users is described by a high dimensional feature vector,
> say
> > n = 1. In my dataset, one row represent one user. So in that case
> > computing the similarity my matrix is not the same as computing the
> > similarity of all users. Then, what does it mean computing the
> similarity of
> > the columns of my matrix in this case ?
> >
> > Best regards,
> >
> > Jao
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark broadcast unavailable

2014-12-10 Thread Haoyuan Li
Which Hadoop version are you using? Seems the exception you got was caused
by incompatible hadoop version.

Best,

Haoyuan

On Wed, Dec 10, 2014 at 12:30 AM, 十六夜涙  wrote:

> Hi All,
> I'v read official docs of tachyon,It seems not fit my usage,For my
> understanding,‍It just cache files in memory,but I have a file contains
> over million lines amount about 70mb,retrieveing data and mapping to a
> *Map* varible will costs over serveral minuts,which I dont want to
> process it each time in map function.since tachyon occurs another problem
> raise an exception while doing *./bin/tachyon format*
> The exception:
> Exception in thread "main" java.lang.RuntimeException:
> org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot
> communicate with client version 4
> ‍It seems there's a compatibility problem with hadoop,but even solved it
> there's still an efficient issue as I described above.‍‍
> could somebody tell me how to  persist the data in memory.for now I just
> broadcast it, and re-submit spark application while the broadcast value
> unavaible.‍
>
>
>
> -- 原始邮件 --
> *发件人:* "Akhil Das";;
> *发送时间:* 2014年12月9日(星期二) 下午3:42
> *收件人:* "十六夜涙";
> *抄送:* "user";
> *主题:* Re: spark broadcast unavailable
>
> You cannot pass the sc object (*val b = Utils.load(sc,ip_lib_path)*)
> inside a map function and that's why the Serialization exception is popping
> up( since sc is not serializable). You can try tachyon's cache if you want
> to persist the data in memory kind of forever.
>
> Thanks
> Best Regards
>
> On Tue, Dec 9, 2014 at 12:12 PM, 十六夜涙  wrote:
>
>> Hi all
>> In my spark application,I load a csv file and map the datas to a Map
>> vairable for later uses on driver node ,then broadcast it,every thing works
>> fine untill the exception java.io.FileNotFoundException occurs.the console
>> log information shows me the broadcast unavailable,I googled this
>> problem,says spark will  clean up the broadcast,while these's an
>> solution,the author mentioned about re-broadcast,I followed this
>> way,written some exception handle code with `try` ,`catch`.after compliling
>> and submitting the jar,I faced anthoner problem,It shows " task
>> not serializable‍".‍‍‍
>> so here I have  there options:
>> 1,get the right way persisting broadcast
>> 2,solve the "task not serializable" problem re-broadcast variable
>> 3,save the data to some kind of database,although I prefer save data in
>> memory.
>>
>> here is come code snippets:
>>   val esRdd = kafkaDStreams.flatMap(_.split("\\n"))
>>   .map{
>>   case esregex(datetime, time_request) =>
>> var ipInfo:Array[String]=Array.empty
>> try{
>> ipInfo = Utils.getIpInfo(client_ip,b.value)
>> }catch{
>>   case e:java.io.FileNotFoundException =>{
>> val b = Utils.load(sc,ip_lib_path)
>> ipInfo = Utils.getIpInfo(client_ip,b.value)
>>   }
>> }
>> ‍
>>
>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Spark 1.1.0 does not spawn more than 6 executors in yarn-client mode and ignores --num-executors

2014-12-10 Thread Aniket Bhatnagar
I am running spark 1.1.0 on AWS EMR and I am running a batch job that
should seems to be highly parallelizable in yarn-client mode. But spark
stop spawning any more executors after spawning 6 executors even though
YARN cluster has 15 healthy m1.large nodes. I even tried providing
'--num-executors 60' argument during spark-submit but even that doesn't
help. A quick look at spark admin UI suggests there are active stages whose
tasks have not been started yet and even then spark doesn't start more
executors. I am not sure why. Any help on this would be greatly appreciated.

Here is link to screen shots that I took of spark admin and yarn admin -
http://imgur.com/a/ztjr7

Thanks,
Aniket


Re: PhysicalRDD problem?

2014-12-10 Thread Michael Armbrust
I'm hesitant to merge that PR in as it is using a brand new configuration
path that is different from the way that the rest of Spark SQL / Spark are
configured.

I'm suspicious that that hitting max iterations is emblematic of some other
issue, as typically resolution happens bottom up, in a single pass.  Can
you provide more details about your schema / query?

On Tue, Dec 9, 2014 at 11:43 PM, Nitin Goyal  wrote:

> I see that somebody had already raised a PR for this but it hasn't been
> merged.
>
> https://issues.apache.org/jira/browse/SPARK-4339
>
> Can we merge this in next 1.2 RC?
>
> Thanks
> -Nitin
>
>
> On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal 
> wrote:
>
>> Hi Michael,
>>
>> I think I have found the exact problem in my case. I see that we have
>> written something like following in Analyzer.scala :-
>>
>>   // TODO: pass this in as a parameter.
>>
>>   val fixedPoint = FixedPoint(100)
>>
>>
>> and
>>
>>
>> Batch("Resolution", fixedPoint,
>>
>>   ResolveReferences ::
>>
>>   ResolveRelations ::
>>
>>   ResolveSortReferences ::
>>
>>   NewRelationInstances ::
>>
>>   ImplicitGenerate ::
>>
>>   StarExpansion ::
>>
>>   ResolveFunctions ::
>>
>>   GlobalAggregates ::
>>
>>   UnresolvedHavingClauseAttributes ::
>>
>>   TrimGroupingAliases ::
>>
>>   typeCoercionRules ++
>>
>>   extendedRules : _*),
>>
>> Perhaps in my case, it reaches the 100 iterations and break out of while
>> loop in RuleExecutor.scala and thus, doesn't "resolve" all the attributes.
>>
>> Exception in my logs :-
>>
>> 14/12/10 04:45:28 INFO HiveContext$$anon$4: Max iterations (100) reached
>> for batch Resolution
>>
>> 14/12/10 04:45:28 ERROR [Sql]: Servlet.service() for servlet [Sql] in
>> context with path [] threw exception [Servlet execution threw an exception]
>> with root cause
>>
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>> Unresolved attributes: 'T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
>> DOWN_BYTESHTTPSUBCR#6567, tree:
>>
>> 'Project ['T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
>> DOWN_BYTESHTTPSUBCR#6567]
>>
>> ...
>>
>> ...
>>
>> ...
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:80)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>>
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)
>>
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>>
>> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>>
>> at scala.collection.immutable.List.foreach(List.scala:318)
>>
>> at
>> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
>>
>> at
>> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
>>
>> at
>> org.apache.spark.sql.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:86)
>>
>> at
>> org.apache.spark.sql.CacheManager$class.writeLock(CacheManager.scala:67)
>>
>> at
>> org.apache.spark.sql.CacheManager$class.cacheQuery(CacheManager.scala:85)
>>
>> at org.apache.spark.sql.SQLContext.cacheQuery(SQLContext.scala:50)
>>
>>  at org.apache.spark.sql.SchemaRDD.cache(SchemaRDD.scala:490)
>>
>>
>> I think the solution here is to have the FixedPoint constructor argument
>> as configurable/parameterized (also written as TODO). Do we have a plan to
>> do this in 1.2 release? Or I can take this up as a task for myself if you
>> want (since this is very crucial for our release).
>>
>>
>> Thanks
>>
>> -Nitin
>>
>> On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust > > wrote:
>>
>>> val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
 existingSchemaRDD.schema)

>>>
>>> This line is throwing away the logical information about
>>> existingSchemaRDD and thus Spark SQL can't know how to push down
>>> projections or predicates past this operator.
>>>
>>> Can you describe more the problems that you see if you don't do this
>>> re

Issue when upgrading from Spark 1.1.0 to 1.1.1: Exception of java.lang.NoClassDefFoundError: io/netty/util/TimerTask

2014-12-10 Thread S. Zhou
Everything worked fine on Spark 1.1.0 until we upgrade to 1.1.1. For some of 
our unit tests we saw the following exceptions. Any idea how to solve it? 
Thanks!
java.lang.NoClassDefFoundError: io/netty/util/TimerTask        at 
org.apache.spark.storage.BlockManager.(BlockManager.scala:72)        at 
org.apache.spark.storage.BlockManager.(BlockManager.scala:168)        at 
org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)        at 
org.apache.spark.SparkContext.(SparkContext.scala:204)        at 
spark.jobserver.util.DefaultSparkContextFactory.makeContext(SparkContextFactory.scala:34)
        at 
spark.jobserver.JobManagerActor.createContextFromConfig(JobManagerActor.scala:255)
        at 
spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:104)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)...



Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Reza Zadeh
As Sean mentioned, you would be computing similar features then.

If you want to find similar users, I suggest running k-means with some
fixed number of clusters. It's not reasonable to try and compute all pairs
of similarities between 1bn items, so k-means with fixed k is more suitable
here.

Best,
Reza

On Wed, Dec 10, 2014 at 10:39 AM, Sean Owen  wrote:

> Well, you're computing similarity of your features then. Whether it is
> meaningful depends a bit on the nature of your features and more on
> the similarity algorithm.
>
> On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa 
> wrote:
> > Dear all,
> >
> > I'm trying to understand what is the correct use case of ColumnSimilarity
> > implemented in RowMatrix.
> >
> > As far as I know, this function computes the similarity of a column of a
> > given matrix. The DIMSUM paper says that it's efficient for large m
> (rows)
> > and small n (columns). In this case the output will be a n by n matrix.
> >
> > Now, suppose I want to compute similarity of several users, say m =
> > billions. Each users is described by a high dimensional feature vector,
> say
> > n = 1. In my dataset, one row represent one user. So in that case
> > computing the similarity my matrix is not the same as computing the
> > similarity of all users. Then, what does it mean computing the
> similarity of
> > the columns of my matrix in this case ?
> >
> > Best regards,
> >
> > Jao
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Manoj Samel
I am using SQLContext.jsonFile. If a valid JSON contains newlines,
spark1.1.1 dumps trace below. If the JSON is read as one line, it works
fine. Is this known?


14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
28)

com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input
within/between OBJECT entries

 at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19]

at com.fasterxml.jackson.core.JsonParser._constructError(
JsonParser.java:1524)

at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS(
ReaderBasedJsonParser.java:1682)

at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(
ReaderBasedJsonParser.java:619)

at
com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(
MapDeserializer.java:412)

at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
MapDeserializer.java:312)

at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
MapDeserializer.java:26)

at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(
ObjectMapper.java:2986)

at com.fasterxml.jackson.databind.ObjectMapper.readValue(
ObjectMapper.java:2091)

at
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
JsonRDD.scala:275)

at
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
JsonRDD.scala:274)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at scala.collection.TraversableOnce$class.reduceLeft(
TraversableOnce.scala:172)

at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)

at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847)

at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845)

at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179)

at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

14/12/10 11:44:02 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 28,
localhost): com.fasterxml.jackson.core.JsonParseException: Unexpected
end-of-input within/between OBJECT entries

 at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19]

com.fasterxml.jackson.core.JsonParser._constructError(
JsonParser.java:1524)

com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS(
ReaderBasedJsonParser.java:1682)

com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(
ReaderBasedJsonParser.java:619)


com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(
MapDeserializer.java:412)


com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
MapDeserializer.java:312)


com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
MapDeserializer.java:26)

com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(
ObjectMapper.java:2986)

com.fasterxml.jackson.databind.ObjectMapper.readValue(
ObjectMapper.java:2091)


org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
JsonRDD.scala:275)


org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
JsonRDD.scala:274)

scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

scala.collection.Iterator$class.foreach(Iterator.scala:727)

scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.reduceLeft(
TraversableOnce.scala:172)

scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)

org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847)

org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845)

org.apache.spark.SparkContext$$anonfun$28.apply(
SparkContext.scala:1179)

org.apache.spark.SparkContext$$anonfun$28.apply(
SparkContext.scala:1179)

org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178
)

java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

java.lang.Thread.run(Thread.java:744)

14/12/10 11:44:02 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1
times; aborting job

14/12/

Spark 1.0.0 Standalone mode config

2014-12-10 Thread 9000revs
I am using CDH5.1 and Spark 1.0.0.

Trying to configure resources to be allocated to each application. How do I
do this? For example, I would each app to use 2 cores and 8G of RAM. I have
tried using the pyspark commandline paramaters for --driver-memory,
--driver-cores and see no effect of those changes in the Spark Master web UI
when the app is started.

Is there anyway to do this from inside Cloudera Manager also?

Thanks.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-Standalone-mode-config-tp20609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cluster getting a null pointer error

2014-12-10 Thread Yana Kadiyska
does spark-submit with SparkPi and spark-examples.jar work?

e.g.

./spark/bin/spark-submit  --class org.apache.spark.examples.SparkPi
--master spark://xx.xx.xx.xx:7077  /path/to/examples.jar


On Tue, Dec 9, 2014 at 6:58 PM, Eric Tanner 
wrote:

> I have set up a cluster on AWS and am trying a really simple hello world
> program as a test.  The cluster was built using the ec2 scripts that come
> with Spark.  Anyway, I have output the error message (using --verbose)
> below.  The source code is further below that.
>
> Any help would be greatly appreciated.
>
> Thanks,
>
> Eric
>
> *Error code:*
>
> r...@ip-xx.xx.xx.xx ~]$ ./spark/bin/spark-submit  --verbose  --class
> com.je.test.Hello --master spark://xx.xx.xx.xx:7077
>  Hello-assembly-1.0.jar
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> Using properties file: /root/spark/conf/spark-defaults.conf
> Adding default property: spark.executor.memory=5929m
> Adding default property:
> spark.executor.extraClassPath=/root/ephemeral-hdfs/conf
> Adding default property:
> spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/
> Using properties file: /root/spark/conf/spark-defaults.conf
> Adding default property: spark.executor.memory=5929m
> Adding default property:
> spark.executor.extraClassPath=/root/ephemeral-hdfs/conf
> Adding default property:
> spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/
> Parsed arguments:
>   master  spark://xx.xx.xx.xx:7077
>   deployMode  null
>   executorMemory  5929m
>   executorCores   null
>   totalExecutorCores  null
>   propertiesFile  /root/spark/conf/spark-defaults.conf
>   extraSparkPropertiesMap()
>   driverMemorynull
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   com.je.test.Hello
>   primaryResource file:/root/Hello-assembly-1.0.jar
>   namecom.je.test.Hello
>   childArgs   []
>   jarsnull
>   verbose true
>
> Default properties from /root/spark/conf/spark-defaults.conf:
>   spark.executor.extraLibraryPath -> /root/ephemeral-hdfs/lib/native/
>   spark.executor.memory -> 5929m
>   spark.executor.extraClassPath -> /root/ephemeral-hdfs/conf
>
>
> Using properties file: /root/spark/conf/spark-defaults.conf
> Adding default property: spark.executor.memory=5929m
> Adding default property:
> spark.executor.extraClassPath=/root/ephemeral-hdfs/conf
> Adding default property:
> spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/
> Main class:
> com.je.test.Hello
> Arguments:
>
> System properties:
> spark.executor.extraLibraryPath -> /root/ephemeral-hdfs/lib/native/
> spark.executor.memory -> 5929m
> SPARK_SUBMIT -> true
> spark.app.name -> com.je.test.Hello
> spark.jars -> file:/root/Hello-assembly-1.0.jar
> spark.executor.extraClassPath -> /root/ephemeral-hdfs/conf
> spark.master -> spark://xxx.xx.xx.xxx:7077
> Classpath elements:
> file:/root/Hello-assembly-1.0.jar
>
> *Actual Error:*
> Exception in thread "main" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> *Source Code:*
> package com.je.test
>
>
> import org.apache.spark.{SparkConf, SparkContext}
>
> class Hello {
>
>   def main(args: Array[String]): Unit = {
>
> val conf = new SparkConf(true)//.set("spark.cassandra.connection.host", 
> "xxx.xx.xx.xxx")
> val sc = new SparkContext("spark://xxx.xx.xx.xxx:7077", "Season", conf)
>
> println("Hello World")
>
>   }
> }
>
>
>
>
>


Re: Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Michael Armbrust
Yep, because sc.textFile will only guarantee that lines will be preserved
across splits, this is the semantic.  It would be possible to write a
custom input format, but that hasn't been done yet.  From the documentation:

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets


>
>- jsonFile - loads data from a directory of JSON files *where each
>line of the files is a JSON object*.
>
>
On Wed, Dec 10, 2014 at 11:48 AM, Manoj Samel 
wrote:

> I am using SQLContext.jsonFile. If a valid JSON contains newlines,
> spark1.1.1 dumps trace below. If the JSON is read as one line, it works
> fine. Is this known?
>
>
> 14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
> 28)
>
> com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input
> within/between OBJECT entries
>
>  at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19]
>
> at com.fasterxml.jackson.core.JsonParser._constructError(
> JsonParser.java:1524)
>
> at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS(
> ReaderBasedJsonParser.java:1682)
>
> at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(
> ReaderBasedJsonParser.java:619)
>
> at
> com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(
> MapDeserializer.java:412)
>
> at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
> MapDeserializer.java:312)
>
> at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
> MapDeserializer.java:26)
>
> at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(
> ObjectMapper.java:2986)
>
> at com.fasterxml.jackson.databind.ObjectMapper.readValue(
> ObjectMapper.java:2091)
>
> at
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
> JsonRDD.scala:275)
>
> at
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
> JsonRDD.scala:274)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at scala.collection.TraversableOnce$class.reduceLeft(
> TraversableOnce.scala:172)
>
> at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
>
> at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847)
>
> at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845)
>
> at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179
> )
>
> at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179
> )
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:744)
>
> 14/12/10 11:44:02 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID
> 28, localhost): com.fasterxml.jackson.core.JsonParseException: Unexpected
> end-of-input within/between OBJECT entries
>
>  at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19]
>
> com.fasterxml.jackson.core.JsonParser._constructError(
> JsonParser.java:1524)
>
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS(
> ReaderBasedJsonParser.java:1682)
>
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(
> ReaderBasedJsonParser.java:619)
>
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(
> MapDeserializer.java:412)
>
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
> MapDeserializer.java:312)
>
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
> MapDeserializer.java:26)
>
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(
> ObjectMapper.java:2986)
>
> com.fasterxml.jackson.databind.ObjectMapper.readValue(
> ObjectMapper.java:2091)
>
>
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
> JsonRDD.scala:275)
>
>
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
> JsonRDD.scala:274)
>
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> scala.collection.TraversableOnce$class.reduceLeft(
> TraversableOnce.scala:172)
>
> scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
>
> org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847)
>
> org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845)
>
> 

MLlib: Libsvm: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-10 Thread Sameer Tilak
Hi All,When I am running LinearRegressionWithSGD, I get the following error. 
Any help on how to debug this further will be highly appreciated.
14/12/10 20:26:02 WARN TaskSetManager: Loss was due to 
java.lang.ArrayIndexOutOfBoundsExceptionjava.lang.ArrayIndexOutOfBoundsException:
 150323 at 
breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:231)
  at 
breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:216)
  at breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60)  
  at breeze.linalg.VectorOps$$anon$178.apply(Vector.scala:391)at 
breeze.linalg.NumericOps$class.dot(NumericOps.scala:83)  at 
breeze.linalg.DenseVector.dot(DenseVector.scala:47)  at 
org.apache.spark.mllib.optimization.LeastSquaresGradient.compute(Gradient.scala:125)
 at 
org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:180)
   at 
org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:179)
   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)  at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)   at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)   at 
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)  at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)  at 
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) at 
org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)at 
org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)at 
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) at 
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
 at java.lang.Thread.run(Thread.java:745)
Best regards,--Sameer.
  

Key not valid / already cancelled using Spark Streaming

2014-12-10 Thread Flávio Santos
Dear Spark'ers,

I'm trying to run a simple job using Spark Streaming (version 1.1.1) and
YARN (yarn-cluster), but unfortunately I'm facing many issues. In short, my
job does the following:
- Consumes a specific Kafka topic
- Writes its content to S3 or HDFS

Records in Kafka are in the form:
{"key": "someString"}

This is important because I use the value of "key" to define the output
file name in S3.
Here are the Spark and Kafka parameters I'm using:

val sparkConf = new SparkConf()
>   .setAppName("MyDumperApp")
>   .set("spark.task.maxFailures", "100")
>   .set("spark.hadoop.validateOutputSpecs", "false")
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.executor.extraJavaOptions", "-XX:+UseCompressedOops")
> val kafkaParams = Map(
>   "zookeeper.connect" -> zkQuorum,
>   "zookeeper.session.timeout.ms" -> "1",
>   "rebalance.backoff.ms" -> "8000",
>   "rebalance.max.retries" -> "10",
>   "group.id" -> group,
>   "auto.offset.reset" -> "largest"
> )


My application is the following:

KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc,
> kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_AND_DISK_SER_2)
>   .foreachRDD((rdd, time) =>
> rdd.map {
>   case (_, line) =>
> val json = parse(line)
> val key = extract(json, "key").getOrElse("key_not_found")
> (key, dateFormatter.format(time.milliseconds)) -> line
> }
>   .partitionBy(new HashPartitioner(10))
>   .saveAsHadoopFile[KeyBasedOutput[(String,String),
> String]]("s3://BUCKET", classOf[BZip2Codec])
>   )


And the last piece:

class KeyBasedOutput[T >: Null, V <: AnyRef] extends
> MultipleTextOutputFormat[T , V] {
>   override protected def generateFileNameForKeyValue(key: T, value: V,
> leaf: String) = key match {
> case (myKey, batchId) =>
>   "somedir" + "/" + myKey + "/" +
> "prefix-" + myKey + "_" + batchId + "_" + leaf
>   }
>   override protected def generateActualKey(key: T, value: V) = null
> }


I use batch sizes of 5 minutes with checkpoints activated.
The job fails nondeterministically (I think it never ran longer than ~5
hours). I have no clue why, it simply fails.
Please find below the exceptions thrown by my application.

I really appreciate any kind of hint.
Thank you very much in advance.

Regards,
-- Flávio

 Executor 1

2014-12-10 19:05:15,150 INFO  [handle-read-write-executor-3]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
SendingConnection
 to ConnectionManagerId(ec2-DRIVER.compute-1.amazonaws.com,49163)
2014-12-10 19:05:15,201 INFO  [Thread-6] storage.MemoryStore
(Logging.scala:logInfo(59)) - ensureFreeSpace(98665) called with
curMem=194463488,
 maxMem=4445479895
2014-12-10 19:05:15,202 INFO  [Thread-6] storage.MemoryStore
(Logging.scala:logInfo(59)) - Block input-0-1418238315000 stored as bytes
in memor
y (estimated size 96.4 KB, free 4.0 GB)
2014-12-10 19:05:16,506 INFO  [handle-read-write-executor-2]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
ReceivingConnecti
on to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,51443)
2014-12-10 19:05:16,506 INFO  [handle-read-write-executor-1]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
SendingConnection
 to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,51443)
2014-12-10 19:05:16,506 INFO  [handle-read-write-executor-2]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
SendingConnection
 to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,51443)
2014-12-10 19:05:16,649 INFO  [handle-read-write-executor-0]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
ReceivingConnecti
on to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,39644)
2014-12-10 19:05:16,649 INFO  [handle-read-write-executor-3]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
SendingConnection
 to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,39644)
2014-12-10 19:05:16,649 INFO  [handle-read-write-executor-0]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
SendingConnection
 to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,39644)
2014-12-10 19:05:16,650 INFO  [connection-manager-thread]
network.ConnectionManager (Logging.scala:logInfo(59)) - Key not valid ?
sun.nio.ch.Se
lectionKeyImpl@da2e041
2014-12-10 19:05:16,651 INFO  [connection-manager-thread]
network.ConnectionManager (Logging.scala:logInfo(80)) - key already
cancelled ? sun.n
io.ch.SelectionKeyImpl@da2e041
*java.nio.channels.CancelledKeyException*
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:316)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:145)
2014-12-10 19:05:16,679 INFO  [handle-read-write-executor-1]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
ReceivingConnection to ConnectionManagerId(
ec2-EXECUTOR.compute-1.amazonaws.com,39444)
2014-12-10 19:05:16,67

Trouble with cache() and parquet

2014-12-10 Thread Yana Kadiyska
Hi folks, wondering if anyone has thoughts. Trying to create something akin
to a materialized view (sqlContext is a HiveContext connected to external
metastore):


val last2HourRdd = sqlContext.sql(s"select * from mytable")
//last2HourRdd.first prints out a  org.apache.spark.sql.Row = [...] with
valid data

 last2HourRdd.cache()
//last2HourRdd.first now fails in an executor with the following:

In the driver:

14/12/10 20:24:01 INFO TaskSetManager: Starting task 0.1 in stage 25.0
(TID 35, iphere, NODE_LOCAL, 2170 bytes)
14/12/10 20:24:01 INFO TaskSetManager: Lost task 0.1 in stage 25.0
(TID 35) on executor iphere: java.lang.ClassCastException (null)
[duplicate 1]

​


And in executor:

14/12/10 19:56:57 ERROR Executor: Exception in task 0.1 in stage 20.0 (TID 27)
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at 
org.apache.spark.sql.catalyst.expressions.MutableInt.update(SpecificMutableRow.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.update(SpecificMutableRow.scala:231)
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setString(SpecificMutableRow.scala:236)
at org.apache.spark.sql.columnar.STRING$.setField(ColumnType.scala:328)
at org.apache.spark.sql.columnar.STRING$.setField(ColumnType.scala:310)
at 
org.apache.spark.sql.columnar.compression.RunLengthEncoding$Decoder.next(compressionSchemes.scala:168)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37)
at 
org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64)
at 
org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54)
at 
org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64)
at 
org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52)
at 
org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64)
at 
org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$9$$anonfun$14$$anon$2.next(InMemoryColumnarTableScan.scala:279)
at 
org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$9$$anonfun$14$$anon$2.next(InMemoryColumnarTableScan.scala:275)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


Any thoughts on this? Not sure if using the external metastore for the
inital pull is a problem or if I'm just hitting a bug.


Re: Trouble with cache() and parquet

2014-12-10 Thread Michael Armbrust
Have you checked to make sure the schema in the metastore matches the
schema in the parquet file?  One way to test would be to just use
sqlContext.parquetFile(...) which infers the schema from the file instead
of using the metastore.

On Wed, Dec 10, 2014 at 12:46 PM, Yana Kadiyska 
wrote:

>
> Hi folks, wondering if anyone has thoughts. Trying to create something
> akin to a materialized view (sqlContext is a HiveContext connected to
> external metastore):
>
>
> val last2HourRdd = sqlContext.sql(s"select * from mytable")
> //last2HourRdd.first prints out a  org.apache.spark.sql.Row = [...] with
> valid data
>
>  last2HourRdd.cache()
> //last2HourRdd.first now fails in an executor with the following:
>
> In the driver:
>
> 14/12/10 20:24:01 INFO TaskSetManager: Starting task 0.1 in stage 25.0 (TID 
> 35, iphere, NODE_LOCAL, 2170 bytes)
> 14/12/10 20:24:01 INFO TaskSetManager: Lost task 0.1 in stage 25.0 (TID 35) 
> on executor iphere: java.lang.ClassCastException (null) [duplicate 1]
>
> ​
>
>
> And in executor:
>
> 14/12/10 19:56:57 ERROR Executor: Exception in task 0.1 in stage 20.0 (TID 27)
> java.lang.ClassCastException: java.lang.String cannot be cast to 
> java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.MutableInt.update(SpecificMutableRow.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.update(SpecificMutableRow.scala:231)
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setString(SpecificMutableRow.scala:236)
>   at org.apache.spark.sql.columnar.STRING$.setField(ColumnType.scala:328)
>   at org.apache.spark.sql.columnar.STRING$.setField(ColumnType.scala:310)
>   at 
> org.apache.spark.sql.columnar.compression.RunLengthEncoding$Decoder.next(compressionSchemes.scala:168)
>   at 
> org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37)
>   at 
> org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64)
>   at 
> org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54)
>   at 
> org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64)
>   at 
> org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52)
>   at 
> org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64)
>   at 
> org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$9$$anonfun$14$$anon$2.next(InMemoryColumnarTableScan.scala:279)
>   at 
> org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$9$$anonfun$14$$anon$2.next(InMemoryColumnarTableScan.scala:275)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
>   at 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
> Any thoughts on this? Not sure if using the external metastore for the
> inital pull is a problem or if I'm just hitting a bug.
>
>
>


Re: MLLib in Production

2014-12-10 Thread Ganelin, Ilya
Hi all – I’ve been storing the model userFeatures and productFeatures vectors 
that are generated internally serialized on disk and importing them as a 
separate job.

From: Sonal Goyal mailto:sonalgoy...@gmail.com>>
Date: Wednesday, December 10, 2014 at 5:31 AM
To: Yanbo Liang mailto:yanboha...@gmail.com>>
Cc: Simon Chan mailto:simonc...@gmail.com>>, Klausen 
Schaefersinho mailto:klaus.schaef...@gmail.com>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: MLLib in Production

You can also serialize the model and use it in other places.

Best Regards,
Sonal
Founder, Nube Technologies





On Wed, Dec 10, 2014 at 5:32 PM, Yanbo Liang 
mailto:yanboha...@gmail.com>> wrote:
Hi Klaus,

There is no ideal method but some workaround.
Train model in Spark cluster or YARN cluster, then use RDD.saveAsTextFile to 
store this model which include weights and intercept to HDFS.
Load weights file and intercept file from HDFS, construct a GLM model, and then 
run model.predict() method, you can get what you want.

The Spark community also have some ongoing work about export model with PMML.

2014-12-10 18:32 GMT+08:00 Simon Chan 
mailto:simonc...@gmail.com>>:
Hi Klaus,

PredictionIO is an open source product based on Spark MLlib for exactly this 
purpose.
This is the tutorial for classification in particular: 
http://docs.prediction.io/classification/quickstart/

You can add custom serving logics and retrieve prediction result through REST 
API/SDKs at other places.

Simon


On Wed, Dec 10, 2014 at 2:25 AM, Klausen Schaefersinho 
mailto:klaus.schaef...@gmail.com>> wrote:
Hi,


I would like to use Spark to train a model, but use the model in some other 
place,, e.g. a servelt to do some classification in real time.

What is the best way to do this? Can I just copy I model file or something and 
load it in the servelt? Can anybody point me to a good tutorial?


Cheers,


Klaus



--
“Overfitting” is not about an excessive amount of physical exercise...





The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-10 Thread happyyxw
How many working nodes do these 100 executors locate at? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584p20610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Filtering nested data using Spark SQL

2014-12-10 Thread Jerry Lam
Hi spark users,

I'm trying to filter a json file that has the following schema using Spark
SQL:

root
 |-- user_id: string (nullable = true)
 |-- item: array (nullable = true)
 ||-- element: struct (containsNull = false)
 |||-- item_id: string (nullable = true)
 |||-- name: string (nullable = true)

I would like to filter distinct user_id based on the items it contains. For
instance, I would like to find out distinct user_id which has item's name
equal to "apple" for the following 'user' table

user_id | item
1 | ([1, apple], [1, apple], [2, orange])
2 | ([2, orange])

The result should be 1

I tried using hql:
select user_id from user lateral view explode(item) itemTable as itemColumn
where itemColumn.name = 'apple' group by user_id

but it seems not efficient if I can just stop looking through the item
array once I find the first item with name 'apple'. Also the "lateral view
explode" and "group by" are unnecessary.

I'm thinking of processing the 'user' table as SchemaRDD. Ideally, I would
love to do (assuming user is a SchemaRDD):

val ids =user.select('user_id).where(contain('item, "name",
"apple")).collect()

the contain function will loop through the item with "name" = "apple" with
early stopping.

Is this possible? If yes,  how one implements the contain function?

Best Regards,

Jerry


Unread block issue w/ spark 1.1.0 on CDH5

2014-12-10 Thread Anson Abraham
I recently installed spark standalone through cloudera manager on my CDH
5.2 cluster.  CDH 5.2 is runing on CentOS release 6.6.  The version of
spark again through Cloudera is 1.1.  It is standalone.

I have a file in hdfs in /tmp/testfile.txt.

So what I do is i run spark-shell:

scala> val source = sc.textFile("/tmp/testfile.tsv")


14/12/10 22:41:14 INFO MemoryStore: ensureFreeSpace(163794) called with
curMem=0, maxMem=278302556
14/12/10 22:41:14 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 160.0 KB, free 265.3 MB)
14/12/10 22:41:14 INFO MemoryStore: ensureFreeSpace(13592) called with
curMem=163794, maxMem=278302556
14/12/10 22:41:14 INFO MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 13.3 KB, free 265.2 MB)
14/12/10 22:41:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on cloudera-3.testdomain.net:34641 (size: 13.3 KB, free: 265.4 MB)
14/12/10 22:41:14 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0
source: org.apache.spark.rdd.RDD[String] = /tmp/testfile.tsv MappedRDD[1]
at textFile at :12

when i type this command:
scala> source.saveAsTextFile("/tmp/zzz_testsparkoutput")

I'm hitting this:

14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
:15
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
unread block data

java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)

java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Cant figure out what the issue is.  The file i'm loading is literally just
7 MB.  I thought it was jar files mismatch, but i did a compare and see
they're all identical.  But seeing as how they were all installed through
CDH parcels, not sure how there would be version mismatch on the nodes and
master.  Oh yeah 1 master node w/ 2 worker nodes and running in standalone
not through yarn.  So as a just in case, i copied the jars from the master
to the 2 worker nodes as just in case, and still same issue.


My spark-env.sh on the servers:

#!/usr/bin/env bash
##
# Generated by Cloudera Manager and should not be modified directly
##

export SPARK_HOME=/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark
export STANDALONE_SPARK_MASTER_HOST=cloudera-1.testdomain.net
export SPARK_MASTER_PORT=7077
expor

Using Intellij for pyspark

2014-12-10 Thread Stephen Boesch
Anyone have luck with this?  An issue encountered is handling multiple
languages - python, java,scala within one module : it is unclear how to
select two module SDK's.

Both Python and Scala facets were added to the spark-parent module. But
when the Project level SDK is not set to Python then the general python
libraries are not present.  Setting the Project level SDK to Python
resolves that issue - but then java sdk classes are not found.


RDDs being cleaned too fast

2014-12-10 Thread ankits
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast.
How can i inspect the size of RDD in memory and get more information about
why it was cleaned up. There should be more than enough memory available on
the cluster to store them, and by default, the spark.cleaner.ttl is
infinite, so I want more information about why this is happening and how to
prevent it.

Spark just logs this when removing RDDs:

[2014-12-11 01:19:34,006] INFO  spark.storage.BlockManager [] [] - Removing
RDD 33
[2014-12-11 01:19:34,010] INFO  pache.spark.ContextCleaner []
[akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33
[2014-12-11 01:19:34,012] INFO  spark.storage.BlockManager [] [] - Removing
RDD 33
[2014-12-11 01:19:34,016] INFO  pache.spark.ContextCleaner []
[akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-being-cleaned-too-fast-tp20613.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Partitioner in sortBy

2014-12-10 Thread Kevin Jung
Hi,
I'm wondering if I change RangePartitioner in sortBy to another partitioner
like HashPartitioner.
The first thing that comes into my head is that it can not be replaceable
due to RangePartitioner is a part of the sort algorithm.
If we call mapPartitions on key based partition after sorting, we need to
repartition or coalece the dataset because it is rangepartitioned.
In this case, we can not avoid shuffle dataset twice during sorting and
repartitioning.
It makes performance issues in large dataset.

Thanks,
Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partitioner-in-sortBy-tp20614.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Regarding Classification of Big Data

2014-12-10 Thread Chintan Bhatt
Hi
How I can do classification of big data using spark?
Which machine algorithm is preferable for that?

-- 
CHINTAN BHATT 
Assistant Professor,
U & P U Patel Department of Computer Engineering,
Chandubhai S. Patel Institute of Technology,
Charotar University of Science And Technology (CHARUSAT),
Changa-388421, Gujarat, INDIA.
http://www.charusat.ac.in
*Personal Website*: https://sites.google.com/a/ecchanga.ac.in/chintan/


Re: Spark 1.0.0 Standalone mode config

2014-12-10 Thread Marcelo Vanzin
Hello,

What do you mean by "app that uses 2 cores and 8G of RAM"?

Spark apps generally involve multiple processes. The command line
options you used affect only one of them (the driver). You may want to
take a look at similar configuration for executors. Also, check the
documentation: http://spark.apache.org/docs/latest/configuration.html


On Wed, Dec 10, 2014 at 11:59 AM, 9000revs <9000r...@gmail.com> wrote:
> I am using CDH5.1 and Spark 1.0.0.
>
> Trying to configure resources to be allocated to each application. How do I
> do this? For example, I would each app to use 2 cores and 8G of RAM. I have
> tried using the pyspark commandline paramaters for --driver-memory,
> --driver-cores and see no effect of those changes in the Spark Master web UI
> when the app is started.
>
> Is there anyway to do this from inside Cloudera Manager also?
>
> Thanks.
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-Standalone-mode-config-tp20609.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: flatMap and spilling of output to disk

2014-12-10 Thread Shixiong Zhu
Good catch. `Join` should use `Iterator`, too. I open an JIRA here:
https://issues.apache.org/jira/browse/SPARK-4824

Best Regards,
Shixiong Zhu

2014-12-10 21:35 GMT+08:00 Johannes Simon :

> Hi!
>
> Using an iterator solved the problem! I've been chewing on this for days,
> so thanks a lot to both of you!! :)
>
> Since in an earlier version of my code, I used a self-join to perform the
> same thing, and ran into the same problems, I just looked at the
> implementation of PairRDDFunction.join (Spark v1.1.1):
>
> def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K,
> (V, W))] = {
>   this.cogroup(other, partitioner).flatMapValues( pair =>
> for (v <- pair._1; w <- pair._2) yield (v, w)
>   )
> }
>
> Is there a reason to not use an iterator here if possible? Pardon my lack
> of Scala knowledge.. This should in any case cause the same problems I had
> when the size of vs/ws gets too large. (Though that question is more of a
> dev ml question)
>
> Thanks!
> Johannes
>
> Am 10.12.2014 um 13:44 schrieb Shixiong Zhu :
>
> for(v1 <- values; v2 <- values) yield ((v1, v2), 1) will generate all
> data at once and return all of them to flatMap.
>
> To solve your problem, you should use for (v1 <- values.iterator; v2 <-
> values.iterator) yield ((v1, v2), 1) which will generate the data when
> it’s necessary.
> ​
>
> Best Regards,
> Shixiong Zhu
>
> 2014-12-10 20:13 GMT+08:00 Johannes Simon :
>
>> Hi!
>>
>> I have been using spark a lot recently and it's been running really well
>> and fast, but now when I increase the data size, it's starting to run into
>> problems:
>> I have an RDD in the form of (String, Iterable[String]) - the
>> Iterable[String] was produced by a groupByKey() - and I perform a flatMap
>> on it that outputs some form of cartesian product of the values per key:
>>
>>
>> rdd.flatMap({case (key, values) => for(v1 <- values; v2 <- values) yield
>> ((v1, v2), 1)})
>>
>>
>> So the runtime cost per RDD entry is O(n^2) where n is the number of
>> values. This n can sometimes be 10,000 or even 100,000. That produces a lot
>> of data, I am aware of that, but otherwise I wouldn't need a cluster, would
>> I? :) For n<=1000 this operation works quite well. But as soon as I allow n
>> to be <= 10,000 or higher, I start to get "GC overhead limit exceeded"
>> exceptions.
>>
>> Configuration:
>> - 7g executor memory
>> - spark.shuffle.memoryFraction=0.5
>> - spark.storage.memoryFraction=0.1
>> I am not sure how the remaining memory for the actual JVM instance
>> performing the flatMap is computed, but I would assume it to be something
>> like (1-0.5-0.1)*7g = 2.8g
>>
>> Now I am wondering: Why should these 2.8g (or say even a few hundred MB)
>> not suffice for spark to process this flatMap without too much GC overhead?
>> If I assume a string to be 10 characters on average, therefore consuming
>> about 60 bytes with overhead taken into account, then 10,000 of these
>> values sum up to no more than ~600kb, and apart from that spark never has
>> to keep anything else in memory.
>>
>> My question: When does spark start to spill RDD entries to disk, assuming
>> that no RDD is to be persisted? Does it keep all output of the flatMap
>> operation in memory until the entire flatMap is done? Or does it already
>> spill every single yielded "((v1, v2), 1)" entry out to disk if necessary?
>>
>> Thanks a lot!
>> Johannes
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Error outputing to CSV file

2014-12-10 Thread manasdebashiskar
saveAsSequenceFile method works on rdd. your object csv is a String.
If you are using spark-shell you can type your object to know it's datatype.
Some prefer eclipse(and it's intelli) to make their live easier.

..Manas



-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-outputing-to-CSV-file-tp20583p20615.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: equivalent to sql in

2014-12-10 Thread manasdebashiskar
If you want to take out "apple" and "orange" you might want to try 
dataRDD.filter(_._2 !="apple").filter(_._2 !="orange") and so on.

...Manas



-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599p20616.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Saving Data only if Dstream is not empty

2014-12-10 Thread manasdebashiskar
Can you do a countApprox as a condition to check non-empty RDD? 

..Manas



-
Manas Kar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587p20617.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Decision Tree with libsvmtools datasets

2014-12-10 Thread Ge, Yao (Y.)
I am testing decision tree using iris.scale data set 
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#iris)
In the data set there are three class labels 1, 2, and 3. However in the 
following code, I have to make numClasses = 4. I will get an 
ArrayIndexOutOfBound Exception if I make the numClasses = 3. Why?

var conf = new SparkConf().setAppName("DecisionTree")
var sc = new SparkContext(conf)

val data = MLUtils.loadLibSVMFile(sc,"data/iris.scale.txt");
val numClasses = 4;
val categoricalFeaturesInfo = Map[Int,Int]();
val impurity = "gini";
val maxDepth = 5;
val maxBins = 100;

val model = DecisionTree.trainClassifier(data, numClasses, 
categoricalFeaturesInfo, impurity, maxDepth, maxBins);

val labelAndPreds = data.map{ point =>
  val prediction = model.predict(point.features);
  (point.label, prediction)
}

val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / 
data.count;
println("Training Error = " + trainErr);
println("Learned classification tree model:\n" + model);

-Yao


Decision Tree with Categorical Features

2014-12-10 Thread Ge, Yao (Y.)
Can anyone provide an example code of using Categorical Features in Decision 
Tree?
Thanks!

-Yao


Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

2014-12-10 Thread Rakesh Nair
Couple of questions :
1. "sqlContext.jsonFile" reads a json file, infers the schema for the data
stored, and then returns a SchemaRDD. Now, i could also create a SchemaRDD
by reading a file as text(which returns RDD[String]) and then use the
"jsonRDD" method. My question, is the "jsonFile" way of creating SchemaRDD
slower than the second method i mentioned (maybe because jsonFile needs to
infer the schema and jsonRDD just applies the schema to a dataset???)

 The workflow i am thinking of is: 1. For the first data set use "jsonFile"
and infer the schema. 2. Save the schema somewhere. 3. For later data sets,
create RDD[String] and then use "jsonRDD" method to convert the RDD[String]
to SchemaRDD.


2. What is the best way to store a schema or rather how can i serialize
StructType and store it in hdfs, so that i can load it later.

-- 
Regards
Rakesh Nair


parquet file not loading (spark v 1.1.0)

2014-12-10 Thread Rahul Bindlish
Hi,

I have created a parquet-file from case-class using "saveAsParquetFile"
Then try to reload using "parquetFile" but it fails.

Sample code is attached.

Any help would be appreciated.

Regards,
Rahul

rahul@...
sample_parquet.sample_parquet

  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/parquet-file-not-loading-spark-v-1-1-0-tp20618.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDDs being cleaned too fast

2014-12-10 Thread Aaron Davidson
The ContextCleaner uncaches RDDs that have gone out of scope on the driver.
So it's possible that the given RDD is no longer reachable in your
program's control flow, or else it'd be a bug in the ContextCleaner.

On Wed, Dec 10, 2014 at 5:34 PM, ankits  wrote:

> I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too
> fast.
> How can i inspect the size of RDD in memory and get more information about
> why it was cleaned up. There should be more than enough memory available on
> the cluster to store them, and by default, the spark.cleaner.ttl is
> infinite, so I want more information about why this is happening and how to
> prevent it.
>
> Spark just logs this when removing RDDs:
>
> [2014-12-11 01:19:34,006] INFO  spark.storage.BlockManager [] [] - Removing
> RDD 33
> [2014-12-11 01:19:34,010] INFO  pache.spark.ContextCleaner []
> [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33
> [2014-12-11 01:19:34,012] INFO  spark.storage.BlockManager [] [] - Removing
> RDD 33
> [2014-12-11 01:19:34,016] INFO  pache.spark.ContextCleaner []
> [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-being-cleaned-too-fast-tp20613.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Issue when upgrading from Spark 1.1.0 to 1.1.1: Exception of java.lang.NoClassDefFoundError: io/netty/util/TimerTask

2014-12-10 Thread Akhil Das
You could try adding netty.io jars
 in the
classpath. Looks like that jar is missing.

Thanks
Best Regards

On Thu, Dec 11, 2014 at 12:15 AM, S. Zhou  wrote:

> Everything worked fine on Spark 1.1.0 until we upgrade to 1.1.1. For some
> of our unit tests we saw the following exceptions. Any idea how to solve
> it? Thanks!
>
> java.lang.NoClassDefFoundError: io/netty/util/TimerTask
> at
> org.apache.spark.storage.BlockManager.(BlockManager.scala:72)
> at
> org.apache.spark.storage.BlockManager.(BlockManager.scala:168)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
> at org.apache.spark.SparkContext.(SparkContext.scala:204)
> at
> spark.jobserver.util.DefaultSparkContextFactory.makeContext(SparkContextFactory.scala:34)
> at
> spark.jobserver.JobManagerActor.createContextFromConfig(JobManagerActor.scala:255)
> at
> spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:104)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> ...
>
>
>


Re: Key not valid / already cancelled using Spark Streaming

2014-12-10 Thread Akhil Das
Try to add the following to the sparkConf

 .set("spark.core.connection.ack.wait.timeout","6000")

  .set("spark.akka.frameSize","60")

Used to face that issue with spark 1.1.0

Thanks
Best Regards

On Thu, Dec 11, 2014 at 2:14 AM, Flávio Santos 
wrote:

> Dear Spark'ers,
>
> I'm trying to run a simple job using Spark Streaming (version 1.1.1) and
> YARN (yarn-cluster), but unfortunately I'm facing many issues. In short, my
> job does the following:
> - Consumes a specific Kafka topic
> - Writes its content to S3 or HDFS
>
> Records in Kafka are in the form:
> {"key": "someString"}
>
> This is important because I use the value of "key" to define the output
> file name in S3.
> Here are the Spark and Kafka parameters I'm using:
>
> val sparkConf = new SparkConf()
>>   .setAppName("MyDumperApp")
>>   .set("spark.task.maxFailures", "100")
>>   .set("spark.hadoop.validateOutputSpecs", "false")
>>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>>   .set("spark.executor.extraJavaOptions", "-XX:+UseCompressedOops")
>> val kafkaParams = Map(
>>   "zookeeper.connect" -> zkQuorum,
>>   "zookeeper.session.timeout.ms" -> "1",
>>   "rebalance.backoff.ms" -> "8000",
>>   "rebalance.max.retries" -> "10",
>>   "group.id" -> group,
>>   "auto.offset.reset" -> "largest"
>> )
>
>
> My application is the following:
>
> KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc,
>> kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_AND_DISK_SER_2)
>>   .foreachRDD((rdd, time) =>
>> rdd.map {
>>   case (_, line) =>
>> val json = parse(line)
>> val key = extract(json, "key").getOrElse("key_not_found")
>> (key, dateFormatter.format(time.milliseconds)) -> line
>> }
>>   .partitionBy(new HashPartitioner(10))
>>   .saveAsHadoopFile[KeyBasedOutput[(String,String),
>> String]]("s3://BUCKET", classOf[BZip2Codec])
>>   )
>
>
> And the last piece:
>
> class KeyBasedOutput[T >: Null, V <: AnyRef] extends
>> MultipleTextOutputFormat[T , V] {
>>   override protected def generateFileNameForKeyValue(key: T, value: V,
>> leaf: String) = key match {
>> case (myKey, batchId) =>
>>   "somedir" + "/" + myKey + "/" +
>> "prefix-" + myKey + "_" + batchId + "_" + leaf
>>   }
>>   override protected def generateActualKey(key: T, value: V) = null
>> }
>
>
> I use batch sizes of 5 minutes with checkpoints activated.
> The job fails nondeterministically (I think it never ran longer than ~5
> hours). I have no clue why, it simply fails.
> Please find below the exceptions thrown by my application.
>
> I really appreciate any kind of hint.
> Thank you very much in advance.
>
> Regards,
> -- Flávio
>
>  Executor 1
>
> 2014-12-10 19:05:15,150 INFO  [handle-read-write-executor-3]
> network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
> SendingConnection
>  to ConnectionManagerId(ec2-DRIVER.compute-1.amazonaws.com,49163)
> 2014-12-10 19:05:15,201 INFO  [Thread-6] storage.MemoryStore
> (Logging.scala:logInfo(59)) - ensureFreeSpace(98665) called with
> curMem=194463488,
>  maxMem=4445479895
> 2014-12-10 19:05:15,202 INFO  [Thread-6] storage.MemoryStore
> (Logging.scala:logInfo(59)) - Block input-0-1418238315000 stored as bytes
> in memor
> y (estimated size 96.4 KB, free 4.0 GB)
> 2014-12-10 19:05:16,506 INFO  [handle-read-write-executor-2]
> network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
> ReceivingConnecti
> on to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,51443)
> 2014-12-10 19:05:16,506 INFO  [handle-read-write-executor-1]
> network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
> SendingConnection
>  to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,51443)
> 2014-12-10 19:05:16,506 INFO  [handle-read-write-executor-2]
> network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
> SendingConnection
>  to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,51443)
> 2014-12-10 19:05:16,649 INFO  [handle-read-write-executor-0]
> network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
> ReceivingConnecti
> on to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,39644)
> 2014-12-10 19:05:16,649 INFO  [handle-read-write-executor-3]
> network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
> SendingConnection
>  to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,39644)
> 2014-12-10 19:05:16,649 INFO  [handle-read-write-executor-0]
> network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
> SendingConnection
>  to ConnectionManagerId(ec2-EXECUTOR.compute-1.amazonaws.com,39644)
> 2014-12-10 19:05:16,650 INFO  [connection-manager-thread]
> network.ConnectionManager (Logging.scala:logInfo(59)) - Key not valid ?
> sun.nio.ch.Se
> lectionKeyImpl@da2e041
> 2014-12-10 19:05:16,651 INFO  [connection-manager-thread]
> network.ConnectionManager (Logging.scala:logInfo(80)) - key already
> cancelled ? sun.n
> io.ch.SelectionKeyImpl@da2e041
> *java.nio.c

Re: PySprak and UnsupportedOperationException

2014-12-10 Thread Mohamed Lrhazi
Thanks Davies. it turns out it was indeed and they fixed it in last night's
nightly build!

https://github.com/elasticsearch/elasticsearch-hadoop/issues/338

On Wed, Dec 10, 2014 at 2:52 AM, Davies Liu  wrote:

> On Tue, Dec 9, 2014 at 11:32 AM, Mohamed Lrhazi
>  wrote:
> > While trying simple examples of PySpark code, I systematically get these
> > failures when I try this.. I dont see any prior exceptions in the
> output...
> > How can I debug further to find root cause?
> >
> >
> > es_rdd = sc.newAPIHadoopRDD(
> > inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
> > keyClass="org.apache.hadoop.io.NullWritable",
> > valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
> > conf={
> > "es.resource" : "en_2014/doc",
> > "es.nodes":"rap-es2",
> > "es.query" :  """{"query":{"match_all":{}},"fields":["title"],
> > "size": 100}"""
> > }
> > )
> >
> >
> > titles=es_rdd.map(lambda d: d[1]['title'][0])
> > counts = titles.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
> > 1)).reduceByKey(add)
> > output = counts.collect()
> >
> >
> >
> > ...
> > 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 93
> > 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_93
> > 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_93 of size 2448
> dropped
> > from memory (free 274984768)
> > 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 93
> > 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 92
> > 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_92
> > 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_92 of size 163391
> > dropped from memory (free 275148159)
> > 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 92
> > 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 91
> > 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_91
> > 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_91 of size 163391
> > dropped from memory (free 275311550)
> > 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 91
> > 14/12/09 19:27:30 ERROR Executor: Exception in task 0.0 in stage 67.0
> (TID
> > 72)
> > java.lang.UnsupportedOperationException
> > at java.util.AbstractMap.put(AbstractMap.java:203)
> > at java.util.AbstractMap.putAll(AbstractMap.java:273)
> > at
> >
> org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:373)
> > at
>
> It looks like it's a bug in ElasticSearch (EsInputFormat).
>
> >
> org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:322)
> > at
> >
> org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(EsInputFormat.java:299)
> > at
> >
> org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.nextKeyValue(EsInputFormat.java:227)
> > at
> > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138)
> > at
> >
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > at
> >
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:913)
> > at
> scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
> > at
> > scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969)
> > at
> > scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
> > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > at
> >
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:339)
> > at
>
> This means that the task failed when it read the data in EsInputFormat
> to feed Python mapper.
>
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
> > at
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> > at
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> > at
> > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364)
> > at
> >
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
> > 14/12/09 19:27:30 INFO TaskSetManager: Starting task 2.0 in stage 67.0
> (TID
> > 74, localhost, ANY, 26266 bytes)
> > 14/12/09 19:27:30 INFO Executor: Running task 2.0 in stage 67.0 (TID 74)
> > 14/12/09 19:27:30 WARN TaskSetManager: Lost task 0.0 in stage 67.0 (TID
> 72,
> > localhost): java.lang.UnsupportedOperationException:
>


RE: Spark-SQL JDBC driver

2014-12-10 Thread Judy Nash
Looks like you are wondering why you cannot see the RDD table you have created 
via thrift?

Based on my own experience with spark 1.1, RDD created directly via Spark SQL 
(i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift, since thrift has 
its own session containing its own RDD.
Spark SQL experts on the forum can confirm on this though.

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Tuesday, December 9, 2014 6:42 AM
To: Anas Mosaad
Cc: Judy Nash; user@spark.apache.org
Subject: Re: Spark-SQL JDBC driver

According to the stacktrace, you were still using SQLContext rather than 
HiveContext. To interact with Hive, HiveContext *must* be used.

Please refer to this page 
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

On 12/9/14 6:26 PM, Anas Mosaad wrote:
Back to the first question, this will mandate that hive is up and running?

When I try it, I get the following exception. The documentation says that this 
method works only on SchemaRDD. I though that countries.saveAsTable did not 
work for that a reason so I created a tmp that contains the results from the 
registered temp table. Which I could validate that it's a SchemaRDD as shown 
below.


@Judy, I do really appreciate your kind support and I want to understand and 
off course don't want to wast your time. If you can direct me the documentation 
describing this details, this will be great.


scala> val tmp = sqlContext.sql("select * from countries")

tmp: org.apache.spark.sql.SchemaRDD =

SchemaRDD[12] at RDD at SchemaRDD.scala:108

== Query Plan ==

== Physical Plan ==

PhysicalRDD 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



scala> tmp.saveAsTable("Countries")

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved plan 
found, tree:

'CreateTableAsSelect None, Countries, false, None

 Project 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]

  Subquery countries

   LogicalRDD 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
 MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)

at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)

at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)

at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)

at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)

at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)

at org.apache.spark.sql.SQLContext$QueryEx

Error on JavaSparkContext.stop()

2014-12-10 Thread Taeyun Kim
Hi,

 

When my spark program calls JavaSparkContext.stop(), the following errors
occur.

   

   14/12/11 16:24:19 INFO Main: sc.stop {

   14/12/11 16:24:20 ERROR ConnectionManager: Corresponding
SendingConnection to ConnectionManagerId(cluster02,38918) not found

   14/12/11 16:24:20 ERROR SendingConnection: Exception while
reading SendingConnection to ConnectionManagerId(cluster04,59659)

   java.nio.channels.ClosedChannelException

 at
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)

 at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)

 at
org.apache.spark.network.SendingConnection.read(Connection.scala:390)

 at
org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.sca
la:205)

 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
45)

 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
15)

 at java.lang.Thread.run(Thread.java:745)

   14/12/11 16:24:20 ERROR ConnectionManager: Corresponding
SendingConnection to ConnectionManagerId(cluster03,59821) not found

   14/12/11 16:24:20 ERROR ConnectionManager: Corresponding
SendingConnection to ConnectionManagerId(cluster02,38918) not found

   14/12/11 16:24:20 WARN ConnectionManager: All connections not
cleaned up

   14/12/11 16:24:20 INFO Main: sc.stop }

 

How can I fix this?

 

The configuration is as follows:

- Spark version is 1.1.1

- Client runs on Windows 7

- The cluster is Linux(CentOS 6.5).

- spark.master=yarn-client

- Since Spark has a problem submitting job from Windows to Linux, I applied
my patch to the Spark source code. (Please see
https://github.com/apache/spark/pull/899 )

 

Spark 1.0.0 did not have this problem.

 

Thanks.