collect failed for unknow reason when deploy use standalone mode

2014-08-11 Thread wan...@testbird.com







Hi ,? ? I use spark 0.9 run a simple computation, but it failled when I use 
standalone mode
code:? ???val sc = new SparkContext(args(0), "BayesAnalysis", 
System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass).toSeq)?

? ??val dataSet = sc.textFile(args(1)).map(_.split(",")).filter(_.length == 
14).collect;

? ??for(record <- dataSet){

??? println(record.mkString(" "))

? ??}

run on local mode, it sucessfully done.run on standalone mode, it failedjava 
-classpath realclaspath mainclass??sparkMaster hdfsFile
14/08/11 15:10:07 INFO scheduler.DAGScheduler: Failed to run collect at 
BayesAnalysis.scala:416

Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
0.0:0 failed 4 times (most recent failure: unknown)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)

at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

at scala.Option.foreach(Option.scala:236)

at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

at akka.actor.ActorCell.invoke(ActorCell.scala:456)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

at akka.dispatch.Mailbox.run(Mailbox.scala:219)

at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


公司:TestBird??
QQ:2741334465
Email: wan...@testbird.com
地址:成都市高新区天府软件园C8-3#??邮编:610041
网址:http://www.testbird.com?



Re: collect failed for unknow reason when deploy use standalone mode

2014-08-11 Thread jeanlyn92
Hi wangyi:
have more detail information?
I guess it maybe caused by need a jars that havn't upload to the
workers,such as your main class.

./bin/spark-class org.apache.spark.deploy.Client launch
   [client-options] \
  \
   [application-options]


application-jar-url: Path to a bundled jar including your application
and all dependencies. Currently, the URL must be globally visible
inside of your cluster, for instance, an `hdfs://` path or a `file://`
path that is present on all nodes.





Hi ,
I use spark 0.9 run a simple computation, but it failled when I use
standalone mode

code:
* val sc = new SparkContext(args(0), "BayesAnalysis",
System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass).toSeq) *




*val dataSet = sc.textFile(args(1)).map(_.split(",")).filter(_.length
== 14).collect; for(record <- dataSet){
println(record.mkString(" ")) }*

run on local mode, it sucessfully done.
run on standalone mode, it failed
java -classpath realclaspath mainclass  sparkMaster hdfsFile


*14/08/11 15:10:07 INFO scheduler.DAGScheduler: Failed to run collect at
BayesAnalysis.scala:416 Exception in thread "main"
org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 4 times
(most recent failure: unknown) *
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org

$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


2014-08-11 15:41 GMT+08:00 jeanlyn92 :

> Hi wangyi:
> have more detail information?
> I guess it maybe cause by need a jars that havn't upload to the works.
>
> ./bin/spark-class org.apache.spark.deploy.Client launch
>[client-options] \
>   \
>[application-options]
>
>
> application-jar-url: Path to a bundled jar including your application and all 
> dependencies. Currently, the URL must be globally visible inside of your 
> cluster, for instance, an `hdfs://` path or a `file://` path that is present 
> on all nodes.
>
>
>
>
>
> Hi ,
> I use spark 0.9 run a simple computation, but it failled when I use
> standalone mode
>
> code:
> * val sc = new SparkContext(args(0), "BayesAnalysis",
> System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass).toSeq) *
>
>
>
>
> *val dataSet = sc.textFile(args(1)).map(_.split(",")).filter(_.length
> == 14).collect;  for(record <- dataSet){
> println(record.mkString(" "))  }*
>
> run on local mode, it sucessfully done.
> run on standalone mode, it failed
> java -classpath realclaspath mainclass  sparkMaster hdfsFile
>
>
> *14/08/11 15:10:07 INFO scheduler.DAGScheduler: Failed to run collect at
> BayesAnalysis.scala:416 Exception in thread "main"
> org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 4 times
> (most recent failure: unknown) *
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at
> org.apache.spark.scheduler.DA

Spark RuntimeException due to Unsupported datatype NullType

2014-08-11 Thread rafeeq s
Hi,

*Spark RuntimeException due to Unsupported datatype NullType , *When saving
null primitives *jsonRDD *with *.saveAsParquetFile()*

*Code: I am trying to* store jsonRDD into Parquet file using *saveAsParquetFile
with below code.*

JavaRDD javaRDD = ssc.sparkContext().parallelize(jsonData);
JavaSchemaRDD schemaObject = sqlContext.jsonRDD(javaRDD);
*schemaObject.saveAsParquetFile*("tweets/tweet" + time.toString().replace("
ms", "") + ".parquet");

*Input: *In below *JSON input* have some *null values* which are not
supported by spark and throwing error as *Unsupported datatype NullType.*

{"id":"tag:search.twitter.com
,2005:11","objectType":"activity","actor":{"objectType":"person","id":"id:
twitter.com:111","link":"http://www.twitter.com/funtubevids","displayName":"مشاهد
حول العالم","postedTime":"2014-05-01T06:14:51.000Z","image":"
https://pbs.twimg.com/profile_images/111/VORNn-Df_normal.png";,
*"summary"*:*null*,"links":[{*"href":null*
,"rel":"me"}],"friendsCount":0,"followersCount":49,"listedCount":0,"statusesCount":61,
*"twitterTimeZone":null*,"verified":false*,"utcOffset":null*
,"preferredUsername":"funtubevids","languages":["en"],"favoritesCount":0},"verb":"post","postedTime":"2014-05-27T17:33:54.000Z","generator":{"displayName":"web","link":"
http://twitter.com
"},"provider":{"objectType":"service","displayName":"Twitter","link":"
http://www.twitter.com"},"link":";
http://twitter.com/funtubevids/statuses/1","body":"القيادة في
مدرج الطيران #مهبط #مدرج #مطار #هبوط #قيادة #سيارة #طائرة #airport #plane
#car https://t.co/gnn7LKE6pC","object":"urls":[{"url":";
https://t.co/gnn7LKE6pC","expanded_url":";
https://www.youtube.com/watch?v=J-j6RSRMvRo
","expanded_status":200}],"klout_score":10,"language":{"value":"ar"}}}


*ERROR* scheduler.JobScheduler: Error running job streaming job
140774119 ms.0
*java.lang.RuntimeException: Unsupported datatype NullType*
   at scala.sys.package$.error(package.scala:27)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:267)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$2.apply(ParquetTypes.scala:244)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$2.apply(ParquetTypes.scala:244)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at
scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:243)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:235)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$2.apply(ParquetTypes.scala:244)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$2.apply(ParquetTypes.scala:244)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at
scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:243)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypes.scala:287)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypes.scala:286)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at
scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:285)
   at
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:331)
   at
org.apache.spark.sql.parquet.

Re: error with pyspark

2014-08-11 Thread Ron Gonzalez
If you're running on Ubuntu, do ulimit -n, which gives the max number of 
allowed open files. You will have to change the value in 
/etc/security/limits.conf to something like 1, logout and log back in.

Thanks,
Ron

Sent from my iPad

> On Aug 10, 2014, at 10:19 PM, Davies Liu  wrote:
> 
>> On Fri, Aug 8, 2014 at 9:12 AM, Baoqiang Cao  wrote:
>> Hi There
>> 
>> I ran into a problem and can’t find a solution.
>> 
>> I was running bin/pyspark < ../python/wordcount.py
> 
> you could use bin/spark-submit  ../python/wordcount.py
> 
>> The wordcount.py is here:
>> 
>> 
>> import sys
>> from operator import add
>> 
>> from pyspark import SparkContext
>> 
>> datafile = '/mnt/data/m1.txt'
>> 
>> sc = SparkContext()
>> outfile = datafile + '.freq'
>> lines = sc.textFile(datafile, 1)
>> counts = lines.flatMap(lambda x: x.split(' ')) \
>>.map(lambda x: (x, 1)) \
>>.reduceByKey(add)
>> output = counts.collect()
>> 
>> outf = open(outfile, 'w')
>> 
>> for (word, count) in output:
>>   outf.write(word.encode('utf-8') + '\t' + str(count) + '\n')
>> outf.close()
>> 
>> 
>> 
>> The error message is here:
>> 
>> 14/08/08 16:01:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 0)
>> java.io.FileNotFoundException:
>> /tmp/spark-local-20140808160150-d36b/12/shuffle_0_0_468 (Too many open
>> files)
> 
> This message means that the Spark (JVM) had reach  the max number of open 
> files,
> there are fd leak some where, unfortunately I can not reproduce this
> problem.  What
> is the version of Spark?
> 
>>at java.io.FileOutputStream.open(Native Method)
>>at java.io.FileOutputStream.(FileOutputStream.java:221)
>>at
>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)
>>at
>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175)
>>at
>> org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
>>at
>> org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
>>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>at
>> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
>>at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>>at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>at java.lang.Thread.run(Thread.java:744)
>> 
>> 
>> The m1.txt is about 4G, and I have >120GB Ram and used -Xmx120GB. It is on
>> Ubuntu. Any help please?
>> 
>> Best
>> Baoqiang Cao
>> Blog: http://baoqiang.org
>> Email: bqcaom...@gmail.com
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql (can it call impala udf)

2014-08-11 Thread marspoc
I want to do the below query that I run in impala calling a c++ UDF in spark
sql.
In which pnl_flat_pp and pfh_flat are both impala table with partitioned.

Can Spark Sql does that?



select a.pnl_type_code,percentile_udf_cloudera(cast(90.0 as
double),sum(pnl_vector1),sum(pnl_vector2),sum(pnl_vector3),sum(pnl_vector4),sum(pnl_vector5),sum(pnl_vector6),sum(pnl_vector7),sum(pnl_vector8),sum(pnl_vector9),sum(pnl_vector10),sum(pnl_vector11),sum(pnl_vector12),sum(pnl_vector13),sum(pnl_vector14))
FROM ibrisk.pnl_flat_pp a JOIN(select portfolio_code from ibrisk.pfh_flat
where pl0_code = '"3"') b ON a.portfolio_code = b.portfolio_code where
rf_level = '"0"' and calc_ref = "7020704" and excl_pnl != '"1"' group by
a.pnl_type_code



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-can-it-call-impala-udf-tp11878.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[spark-streaming] kafka source and flow control

2014-08-11 Thread gpasquiers
Hi,

I’m new to this mailing list as well as spark-streaming.

I’m using spark-streaming in a cloudera environment to consume a kafka
source and store all data into hdfs. There is a great volume of data and our
issue is that the kafka consumer is going too fast for HDFS, it fills up the
storage (memory) of spark-streaming, and makes it drop messages that have
not been saved to HDFS yet…

I’m not at all in an optimal environment regarding performances (spark
running in a VM, accessing HDFS through an SSH tunnel), but I think it’s not
a bad setup to test my system’s reliability.

I’ve been (unsuccessfully) looking for a way to slow down my kafka consumer
depending on the system’s health, maybe make the store() operation block if
there is no available room for storage.

I have complete control over the kafka consumer since I developed a custom
Receiver as a workaround to :
https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow
control more intelligent than a simple rate limite (x messages or bytes per
second).

I’m interested in all ideas or suggestions.

B.R.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-source-and-flow-control-tp11879.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CDH5, HiveContext, Parquet

2014-08-11 Thread chutium
hive-thriftserver does not work with parquet tables in hive metastore also,
this PR will fix it too?

do not need to change any pom.xml ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CDH5-HiveContext-Parquet-tp11853p11880.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Low Performance of Shark over Spark.

2014-08-11 Thread vinay . kashyap



Hi Yana,
I notice there is GC happening in every executor which
is around 400ms on an average. Do you think it is a major impact on the
overall query time..??
And regarding the memory for a single worker,
I have tried distributing the memory by increasing the number of workers
per node and dividing the memory for each worker accordingly.
Means,
in my trials I have configured 96G, 48G for each worker, but dint see any
difference in the query time.
You have any comments regarding the GC
time..?? Because, as I have mentioned in the earlier mails, I have varied
the memory fractions but not much difference was
seen.
 
Thanks and regards
Vinay Kashyap

 


From:"Yana Kadiyska" 

Sent:"vinay.kashyap" 

Date:Sat, August 9, 2014 6:56 am

Subject:Re: Low Performance of Shark over Spark.







Can you see where your time is spent? I have noticed quite a bit of
variance in query time in the case where GC occurs in the middle of a
computation. I'm guessing you're talking time averaged over multiple runs
but thought I'd mention this as a possible thing to check. I might be
mistaken but 96G in a single worker(if I'm reading your mail correctly)
seems on the high side (although I cannot find now any recommendation on
what it should be)

 







On Fri, Aug 8, 2014 at 4:45 AM, vinay.kashyap 
vinay.kash...@socialinfra.net> wrote:


Hi Mayur,



I cannot use spark sql in this case because many of the 
aggregations
are not

supported yet. Hence I migrated back to use Shark as 
all those
aggregation

functions are supported.




apache-spark-user-list.1001560.n3.nabble.com/Support-for-Percentile-and-Variance-Aggregation-functions-in-Spark-with-HiveContext-td10658.html


http://apache-spark-user-list.1001560.n3.nabble.com/Support-for-Percentile-and-Variance-Aggregation-functions-in-Spark-with-HiveContext-td10658.html>



Forgot to mention in the earlier thread, that the 
raw_table which I am
using

is actually a parquet table.




>> 2. cache data at a partition level from Hive 
& operate on
those instead.

 
Do you mean that I need to cache the table created by 
querying data for
set

of few months and then issue the adhoc query on that 
table.??








Thanks and regards

Vinay Kashyap





 
--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Low-Performance-of-Shark-over-Spark-tp11649p11776.html

Sent from the Apache Spark User List mailing list 
archive at
Nabble.com.






-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org

For additional commands, e-mail: 
user-h...@spark.apache.org

 







Re: How to direct insert vaules into SparkSQL tables?

2014-08-11 Thread chutium
no, spark sql can not insert or update textfile yet, can only insert into
parquet files

but,

people.union(new_people).registerAsTable("people")

could be an idea.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p11882.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
Hi,

We intend to apply other operations on the data later in the same spark 
context, but our first step is to archive it.

Our goal is somth like this
Step 1 : consume kafka
Step 2 : archive to hdfs AND send to step 3
Step 3 : transform data
Step 4 : save transformed data to HDFS as input for M/R

Yes, maybe spark-streaming isn’t the tool adapted to our needs, but it looked 
like it so I wonder if I didn’t miss something.
To us it looks like a great flaw if, in streaming mode, spark-streaming cannot 
slow down it’s consumption depending on the available resources.

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: lundi 11 août 2014 11:44
To: Gwenhael Pasquiers
Subject: Re: [spark-streaming] kafka source and flow control

Hi,

On Mon, Aug 11, 2014 at 6:19 PM, gpasquiers 
mailto:gwenhael.pasqui...@ericsson.com>> wrote:
I’m using spark-streaming in a cloudera environment to consume a kafka
source and store all data into hdfs.

I assume you are doing something else in between? If not, maybe a software such 
as Apache Flume may be better suited?

I have complete control over the kafka consumer since I developed a custom
Receiver as a workaround to :
https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow
control more intelligent than a simple rate limite (x messages or bytes per
second).

I’m interested in all ideas or suggestions.

I think what I would try to do is measuring how much data I can process within 
one time window (i.e., keep track of processing speed) and then (continuously?) 
adapt the data rate to something that I am capable of processing. In that case 
you would have to make sure that the data doesn't instead get lost within 
Kafka. After all, the problem seems to be that your HDFS is too slow and you'll 
have to buffer that data *somewhere*, right?

Tobias


how to split RDD by key and save to different path

2014-08-11 Thread 诺铁
hi,

I have googled and find similar question without good answer,
http://stackoverflow.com/questions/24520225/writing-to-hadoop-distributed-file-system-multiple-times-with-spark

in short, I would like to separate raw data and divide by some key, for
example, create date, and put the in directory named by date, so that I can
easily access portion of data later.

for now I have to extract all keys and then filter by key and save to file
repeatly. are there any good way to do this?  or maybe I shouldn't do such
thing?


ERROR UserGroupInformation: Can't find user in Subject:

2014-08-11 Thread Dan Foisy
Hi

I've installed Spark on a Windows 7 machine.  I can get the SparkShell up
and running but when running through the simple example in Getting Started,
I get the following error (tried running as administrator as well) - any
ideas?

scala> val textFile = sc.textFile("README.md")
14/08/11 08:55:52 WARN SizeEstimator: Failed to check whether
UseCompressedOops is set; assuming yes
14/08/11 08:55:52 INFO MemoryStore: ensureFreeSpace(34220) called with
curMem=0,  maxMem=322122547
14/08/11 08:55:52 INFO MemoryStore: Block broadcast_0 stored as values to
memory  (estimated size 33.4 KB, free 307.2 MB)
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
:12

scala> textFile.count()

*14/08/10 08:55:58 ERROR UserGroupInformation: Can't find user in Subject:*
*Principal: NTUserPrincipal: danfoisy*


RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
I didn’t reply to the last part of your message:

My source is Kafka, kafka already acts as a buffer with a lot of space.

So when I start my spark job, there is a lot of data to catch up (and it is 
critical not to lose any), but the kafka consumer goes as fast as it can (and 
it’s faster than my hdfs’s maximum).

I think the kind of self-regulating system you describe would be too difficult 
to implement and probably unreliable (even more with the fact that we have 
multiple slaves).



From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: lundi 11 août 2014 11:44
To: Gwenhael Pasquiers
Subject: Re: [spark-streaming] kafka source and flow control

Hi,

On Mon, Aug 11, 2014 at 6:19 PM, gpasquiers 
mailto:gwenhael.pasqui...@ericsson.com>> wrote:
I’m using spark-streaming in a cloudera environment to consume a kafka
source and store all data into hdfs.

I assume you are doing something else in between? If not, maybe a software such 
as Apache Flume may be better suited?

I have complete control over the kafka consumer since I developed a custom
Receiver as a workaround to :
https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow
control more intelligent than a simple rate limite (x messages or bytes per
second).

I’m interested in all ideas or suggestions.

I think what I would try to do is measuring how much data I can process within 
one time window (i.e., keep track of processing speed) and then (continuously?) 
adapt the data rate to something that I am capable of processing. In that case 
you would have to make sure that the data doesn't instead get lost within 
Kafka. After all, the problem seems to be that your HDFS is too slow and you'll 
have to buffer that data *somewhere*, right?

Tobias


looking for a definitive RDD.Pipe() example?

2014-08-11 Thread pjv0580
All,

I have been searching the web for a few days looking for a definitive
Spark/Spark Streaming RDD.Pipe() example and cannot find one. Would it be
possible to share with the group an example of the both the Java/Scala side
as well as the script (ex Python) side? Any help or response would be very
much appreciated.

It also might be a good idea to include this code into a working example for
one of the streaming exercises on Github as i see this question a lot on the
web with no answer except "look at the documentation"

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/looking-for-a-definitive-RDD-Pipe-example-tp11890.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: error with pyspark

2014-08-11 Thread Baoqiang Cao
Thanks Daves and Ron!

It indeed was due to ulimit issue. Thanks a lot!
 
Best,
Baoqiang Cao
Blog: http://baoqiang.org
Email: bqcaom...@gmail.com




On Aug 11, 2014, at 3:08 AM, Ron Gonzalez  wrote:

> If you're running on Ubuntu, do ulimit -n, which gives the max number of 
> allowed open files. You will have to change the value in 
> /etc/security/limits.conf to something like 1, logout and log back in.
> 
> Thanks,
> Ron
> 
> Sent from my iPad
> 
>> On Aug 10, 2014, at 10:19 PM, Davies Liu  wrote:
>> 
>>> On Fri, Aug 8, 2014 at 9:12 AM, Baoqiang Cao  wrote:
>>> Hi There
>>> 
>>> I ran into a problem and can’t find a solution.
>>> 
>>> I was running bin/pyspark < ../python/wordcount.py
>> 
>> you could use bin/spark-submit  ../python/wordcount.py
>> 
>>> The wordcount.py is here:
>>> 
>>> 
>>> import sys
>>> from operator import add
>>> 
>>> from pyspark import SparkContext
>>> 
>>> datafile = '/mnt/data/m1.txt'
>>> 
>>> sc = SparkContext()
>>> outfile = datafile + '.freq'
>>> lines = sc.textFile(datafile, 1)
>>> counts = lines.flatMap(lambda x: x.split(' ')) \
>>>   .map(lambda x: (x, 1)) \
>>>   .reduceByKey(add)
>>> output = counts.collect()
>>> 
>>> outf = open(outfile, 'w')
>>> 
>>> for (word, count) in output:
>>>  outf.write(word.encode('utf-8') + '\t' + str(count) + '\n')
>>> outf.close()
>>> 
>>> 
>>> 
>>> The error message is here:
>>> 
>>> 14/08/08 16:01:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 0)
>>> java.io.FileNotFoundException:
>>> /tmp/spark-local-20140808160150-d36b/12/shuffle_0_0_468 (Too many open
>>> files)
>> 
>> This message means that the Spark (JVM) had reach  the max number of open 
>> files,
>> there are fd leak some where, unfortunately I can not reproduce this
>> problem.  What
>> is the version of Spark?
>> 
>>>   at java.io.FileOutputStream.open(Native Method)
>>>   at java.io.FileOutputStream.(FileOutputStream.java:221)
>>>   at
>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)
>>>   at
>>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175)
>>>   at
>>> org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
>>>   at
>>> org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
>>>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>   at
>>> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
>>>   at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>   at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>   at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>>>   at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>   at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>   at java.lang.Thread.run(Thread.java:744)
>>> 
>>> 
>>> The m1.txt is about 4G, and I have >120GB Ram and used -Xmx120GB. It is on
>>> Ubuntu. Any help please?
>>> 
>>> Best
>>> Baoqiang Cao
>>> Blog: http://baoqiang.org
>>> Email: bqcaom...@gmail.com
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 



Spark app slowing down and I'm unable to kill it

2014-08-11 Thread Grzegorz Białek
Hi,

I ran Spark application in local mode with command:
$SPARK_HOME/bin/spark-submit --driver-memory 1g  
with set master=local.

After around 10 minutes of computing it started to slow down
significantly that next stage took around 50 minutes and next after 5 hours
in 80%
done and CPU usage decreased from 160% to almost 0% (according to system
monitor) where 200% is max for one core. (This stage that took more than 5h
was
saveAsTextFile on 50MB RDD).
During that computing my system was significantly slower and less
responsible.

Moreover when I wanted to interrupt this application I tried ctrl-c and
current stage was interrupted but program didn't exit. When I shut down my
computer system
didn't want to shutdown because killing Spark app failed. (It shut down
after ctrl-alt-del). After reboot system didn't want to start and finally I
had to reinstall it.

The same thing was when I killed next app using kill -9, system was
corrupted and I had to reinstall it.


When I ran this app on a bit smaller data everything was ok. I have Linux
Mint 17, 8GB RAM, Intel Core i7-3630QM (4x2,4GHz).

Do you have any idea why Spark slow down or how to properly kill spark app
run through spark-submit?

Thanks,
Grzegorz


Re: Spark app slowing down and I'm unable to kill it

2014-08-11 Thread Grzegorz Białek
I'm using Spark 1.0.0


On Mon, Aug 11, 2014 at 4:14 PM, Grzegorz Białek <
grzegorz.bia...@codilime.com> wrote:

> Hi,
>
> I ran Spark application in local mode with command:
> $SPARK_HOME/bin/spark-submit --driver-memory 1g  
> with set master=local.
>
> After around 10 minutes of computing it started to slow down
> significantly that next stage took around 50 minutes and next after 5
> hours in 80%
> done and CPU usage decreased from 160% to almost 0% (according to system
> monitor) where 200% is max for one core. (This stage that took more than 5h
> was
> saveAsTextFile on 50MB RDD).
> During that computing my system was significantly slower and less
> responsible.
>
> Moreover when I wanted to interrupt this application I tried ctrl-c and
> current stage was interrupted but program didn't exit. When I shut down my
> computer system
> didn't want to shutdown because killing Spark app failed. (It shut down
> after ctrl-alt-del). After reboot system didn't want to start and finally I
> had to reinstall it.
>
> The same thing was when I killed next app using kill -9, system was
> corrupted and I had to reinstall it.
>
>
> When I ran this app on a bit smaller data everything was ok. I have Linux
> Mint 17, 8GB RAM, Intel Core i7-3630QM (4x2,4GHz).
>
> Do you have any idea why Spark slow down or how to properly kill spark app
> run through spark-submit?
>
> Thanks,
> Grzegorz
>



-- 
Grzegorz Białek
Software Engineer


--
M: +48 664 377 399
E: grzegorz.bialek@ codilime.com
--

CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
Capital City of Warsaw, XII Commercial Department of the National Court
Register. Entered into National Court Register under No. KRS 388871.
Tax identification number (NIP) 5272657478. Statistical number (REGON)
142974628.

-
The information in this email is confidential and may be legally
privileged, it may contain information that is confidential in CodiLime Sp.
z o.o. It is intended solely for the addressee. Any access to this email by
third parties is unauthorized. If you are not the intended recipient of
this message, any disclosure, copying, distribution or any action
undertaken or neglected in reliance thereon is prohibited and may result in
your liability for damages.


Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
I got the same exception after the streaming job runs for a while, The
ERROR message was complaining about a temp file not being found in the
output folder.

14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job
140774430 ms.0
java.io.FileNotFoundException: File
hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
does not exist.
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
at
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
at
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
at
org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
at
org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


On Fri, Jul 25, 2014 at 7:04 PM, Bill Jay 
wrote:

> I just saw another error after my job was run for 2 hours:
>
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. 
> Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>
>   at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
>   at sun.reflect.GeneratedMethodAccessor146.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.Re

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
The exception was thrown out in application master(spark streaming driver)
and the job shut down after this exception.


On Mon, Aug 11, 2014 at 10:29 AM, Chen Song  wrote:

> I got the same exception after the streaming job runs for a while, The
> ERROR message was complaining about a temp file not being found in the
> output folder.
>
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job
> 140774430 ms.0
> java.io.FileNotFoundException: File
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
> does not exist.
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> On Fri, Jul 25, 2014 at 7:04 PM, Bill Jay 
> wrote:
>
>> I just saw another error after my job was run for 2 hours:
>>
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not 
>> exist. Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open 
>> files.
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766)
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674)
>>  at 
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
>>  at 
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
>>  at 
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>  at 
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:415)
>>  at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>
>>  at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>>  at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>>  at 
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>  at com.sun.proxy.$

share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread chutium
sharing /reusing RDDs is always useful for many use cases, is this possible
via persisting RDD on tachyon?

such as off heap persist a named RDD into a given path (instead of
/tmp_spark_tachyon/spark-xxx-xxx-xxx)
or
saveAsParquetFile on tachyon

i tried to save a SchemaRDD on tachyon, 

val parquetFile =
sqlContext.parquetFile("hdfs://test01.zala:8020/user/hive/warehouse/parquet_tables.db/some_table/")
parquetFile.saveAsParquetFile("tachyon://test01.zala:19998/parquet_1")

but always error, first error message is:

14/08/11 16:19:28 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in
memory on test03.zala:37377 (size: 18.7 KB, free: 16.6 GB)
14/08/11 16:20:06 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 3.0
(TID 35, test04.zala): java.io.IOException:
FailedToCheckpointException(message:Failed to rename
hdfs://test01.zala:8020/tmp/tachyon/workers/140776003/31806/730 to
hdfs://test01.zala:8020/tmp/tachyon/data/730)
tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:112)
tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:168)
tachyon.client.FileOutStream.close(FileOutStream.java:104)
   
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
   
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
   
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
   
parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:259)
   
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
   
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)



hdfs://test01.zala:8020/tmp/tachyon/
already chmod to 777, both owner and group is same as spark/tachyon startup
user

off-heap persist or saveAs normal text file on tachyon works fine.

CDH 5.1.0, spark 1.1.0 snapshot, tachyon 0.6 snapshot



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/share-reuse-off-heap-persisted-tachyon-RDD-in-SparkContext-or-saveAsParquetFile-on-tachyon-in-SQLCont-tp11897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
Bill

Did you get this resolved somehow? Anyone has any insight into this problem?

Chen


On Mon, Aug 11, 2014 at 10:30 AM, Chen Song  wrote:

> The exception was thrown out in application master(spark streaming driver)
> and the job shut down after this exception.
>
>
> On Mon, Aug 11, 2014 at 10:29 AM, Chen Song 
> wrote:
>
>> I got the same exception after the streaming job runs for a while, The
>> ERROR message was complaining about a temp file not being found in the
>> output folder.
>>
>> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job
>> 140774430 ms.0
>> java.io.FileNotFoundException: File
>> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>> does not exist.
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
>> at
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
>> at
>> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
>> at
>> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
>> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>> at scala.util.Try$.apply(Try.scala:161)
>> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>> at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> On Fri, Jul 25, 2014 at 7:04 PM, Bill Jay 
>> wrote:
>>
>>> I just saw another error after my job was run for 2 hours:
>>>
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not 
>>> exist. Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open 
>>> files.
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766)
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674)
>>> at 
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
>>> at 
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
>>> at 
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>> at 
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>> at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>>>
>>> at org.apach

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
I've also been seeing similar stacktraces on Spark core (not streaming) and
have a theory it's related to spark.speculation being turned on.  Do you
have that enabled by chance?


On Mon, Aug 11, 2014 at 8:10 AM, Chen Song  wrote:

> Bill
>
> Did you get this resolved somehow? Anyone has any insight into this
> problem?
>
> Chen
>
>
> On Mon, Aug 11, 2014 at 10:30 AM, Chen Song 
> wrote:
>
>> The exception was thrown out in application master(spark streaming
>> driver) and the job shut down after this exception.
>>
>>
>> On Mon, Aug 11, 2014 at 10:29 AM, Chen Song 
>> wrote:
>>
>>> I got the same exception after the streaming job runs for a while, The
>>> ERROR message was complaining about a temp file not being found in the
>>> output folder.
>>>
>>> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job
>>> 140774430 ms.0
>>> java.io.FileNotFoundException: File
>>> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>>> does not exist.
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
>>> at
>>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
>>> at
>>> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
>>> at
>>> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
>>> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
>>> at
>>> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
>>> at
>>> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
>>> at
>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
>>> at
>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>> at
>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>> at scala.util.Try$.apply(Try.scala:161)
>>> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>>> at
>>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> On Fri, Jul 25, 2014 at 7:04 PM, Bill Jay 
>>> wrote:
>>>
 I just saw another error after my job was run for 2 hours:


 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not 
 exist. Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open 
 files.
at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766)
at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674)
at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java

Parallelizing a task makes it freeze

2014-08-11 Thread sparkuser2345
I have an array 'dataAll' of key-value pairs where each value is an array of
arrays. I would like to parallelize a task over the elements of 'dataAll' to
the workers. In the dummy example below, the number of elements in 'dataAll'
is 3 but in real application it would be tens to hundreds. 

Without parallelizing dataAll, 'result' is calculated in less than a second: 

import org.jblas.DoubleMatrix  

val nY = 5000
val nX = 400

val dataAll = Array((1, Array.fill(nY)(Array.fill(nX)(1.0))),
(2, Array.fill(nY)(Array.fill(nX)(1.0))),
(3, Array.fill(nY)(Array.fill(nX)(1.0

val w1 = DoubleMatrix.ones(400)

// This finishes in less than a second: 
val result = dataAll.map { dat =>
  val c   = dat._1
  val dataArr = dat._2
  // Map over the Arrays within dataArr: 
  val test = dataArr.map { arr =>
val test2 = new DoubleMatrix(arr.length, 1, arr:_*)
val out = test2.dot(w1)
out
  }
  (c, test)
}

However, when I parallelize dataAll, the same task freezes: 

val dataAllRDD = sc.parallelize(dataAll, 3)

// This doesn't finish in several minutes: 
val result = dataAllRDD.map { dat =>
  val c   = dat._1
  val dataArr = dat._2
  // Map over the Arrays within dataArr: 
  val test = dataArr.map { arr =>
val test2 = new DoubleMatrix(arr.length, 1, arr:_*)
val out = test2.dot(w1)
out
  }
  (c, test)
}.collect

After sending the above task, nothing is written to the worker logs (as
viewed through the web UI), but the following output is printed in the Spark
shell where I'm running the task: 

14/08/11 18:17:31 INFO SparkContext: Starting job: collect at :33
14/08/11 18:17:31 INFO DAGScheduler: Got job 0 (collect at :33)
with 3 output partitions (allowLocal=false)
14/08/11 18:17:31 INFO DAGScheduler: Final stage: Stage 0 (collect at
:33)
14/08/11 18:17:31 INFO DAGScheduler: Parents of final stage: List()
14/08/11 18:17:31 INFO DAGScheduler: Missing parents: List()
14/08/11 18:17:31 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map
at :23), which has no missing parents
14/08/11 18:17:32 INFO DAGScheduler: Submitting 3 missing tasks from Stage 0
(MappedRDD[1] at map at :23)
14/08/11 18:17:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
14/08/11 18:17:32 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
executor 2:  (PROCESS_LOCAL)
14/08/11 18:17:32 INFO TaskSetManager: Serialized task 0.0:0 as 16154060
bytes in 69 ms
14/08/11 18:17:32 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
executor 1:  (PROCESS_LOCAL)
14/08/11 18:17:32 INFO TaskSetManager: Serialized task 0.0:1 as 16154060
bytes in 81 ms
14/08/11 18:17:32 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on
executor 0:  (PROCESS_LOCAL)
14/08/11 18:17:32 INFO TaskSetManager: Serialized task 0.0:2 as 16154060
bytes in 66 ms


dataAllRDD.map does work with smaller array though (e.g. nY = 100; finishes
in less than a second). 

Why is dataAllRDD.map so much slower than dataAll.map, or even not executing
at all? 

The Spark version I'm using is 0.9.0. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parallelizing-a-task-makes-it-freeze-tp11900.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can I share the RDD between multiprocess

2014-08-11 Thread coolfrood
Reviving this discussion again...

I'm interested in using Spark as the engine for a web service.

The SparkContext and its RDDs only exist in the JVM that started it.  While
RDDs are resilient, this means the context owner isn't resilient, so I may
be able to serve requests out of a single "service" JVM, but I'll lose all
my RDDs if the service dies.

It's possible to share RDDs by writing them into Tachyon, but with that I'll
end up having at least 2 copies of the same data in memory; even more if I
access the data from multiple contexts.

Is there a way around this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-I-share-the-RDD-between-multiprocess-tp916p11901.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ClassNotFound for user class in uber-jar

2014-08-11 Thread lbustelo
I've see this same exact problem too and I've been ignoring, but I wonder if
I'm loosing data. Can anyone at least comment on this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-tp10613p11902.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ClassNotFound exception on class in uber.jar

2014-08-11 Thread lbustelo
Not sure if this problem reached the Spark guys because it shows in Nabble
that "This post has NOT been accepted by the mailing list yet".

http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-td10613.html#a11902

I'm resubmitting. 

Greetings, 

I'm currently building a "fat" or "uber" jar with dependencies using maven
that a docker-ized Spark cluster (1 master 3 workers, version 1.0.0, scala
2.10.4) points to locally on the same VM. It seems that sometimes a
particular class is found, and things are fine, and other times it is not.
Doing a find on the jar affirms that it is actually there. I `setJars` with
JavaStreamingContext.jarOfClass(the main class). 

I cannot say I know much about how the ClassPath mechanisms of Spark so I
appreciate any and all suggestions to find out what exactly is happening. 

The exception is as follows: 

14/07/24 18:48:52 INFO Executor: Sending result for 139 directly to driver  

   
14/07/24 18:48:52 INFO Executor: Finished task ID 139   

  
14/07/24 18:48:56 WARN BlockManager: Putting block input-0-1406227713800
failed  
 
14/07/24 18:48:56 ERROR BlockManagerWorker: Exception handling buffer
message 
java.lang.ClassNotFoundException: com.cjm5325.MyProject.MyClass 

  
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)   

  
at java.net.URLClassLoader$1.run(URLClassLoader.java:355) 
at java.security.AccessController.doPrivileged(Native Method)   

  
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)   

  
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

   
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

 
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

 
at java.lang.Class.forName0(Native Method)  

 
at java.lang.Class.forName(Class.java:270)  

 
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60)

   
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) 

  
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) 

  
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)   

  
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)   

  
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) 
 

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
Andrew that is a good finding.

Yes, I have speculative execution turned on, becauseI saw tasks stalled on
HDFS client.

If I turned off speculative execution, is there a way to circumvent the
hanging task issue?



On Mon, Aug 11, 2014 at 11:13 AM, Andrew Ash  wrote:

> I've also been seeing similar stacktraces on Spark core (not streaming)
> and have a theory it's related to spark.speculation being turned on.  Do
> you have that enabled by chance?
>
>
> On Mon, Aug 11, 2014 at 8:10 AM, Chen Song  wrote:
>
>> Bill
>>
>> Did you get this resolved somehow? Anyone has any insight into this
>> problem?
>>
>> Chen
>>
>>
>> On Mon, Aug 11, 2014 at 10:30 AM, Chen Song 
>> wrote:
>>
>>> The exception was thrown out in application master(spark streaming
>>> driver) and the job shut down after this exception.
>>>
>>>
>>> On Mon, Aug 11, 2014 at 10:29 AM, Chen Song 
>>> wrote:
>>>
 I got the same exception after the streaming job runs for a while, The
 ERROR message was complaining about a temp file not being found in the
 output folder.

 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job
 140774430 ms.0
 java.io.FileNotFoundException: File
 hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
 does not exist.
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
 at
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
 at
 org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
 at
 org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 On Fri, Jul 25, 2014 at 7:04 PM, Bill Jay 
 wrote:

> I just saw another error after my job was run for 2 hours:
>
>
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not 
> exist. Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open 
> files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Cheng Lian
Since you were using hql(...), it’s probably not related to JDBC driver.
But I failed to reproduce this issue locally with a single node pseudo
distributed YARN cluster. Would you mind to elaborate more about steps to
reproduce this bug? Thanks
​


On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian  wrote:

> Hi Jenny, does this issue only happen when running Spark SQL with YARN in
> your environment?
>
>
> On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao  wrote:
>
>>
>> Hi,
>>
>> I am able to run my hql query on yarn cluster mode when connecting to the
>> default hive metastore defined in hive-site.xml.
>>
>> however, if I want to switch to a different database, like:
>>
>>   hql("use other-database")
>>
>>
>> it only works in yarn client mode, but failed on yarn-cluster mode with
>> the following stack:
>>
>> 14/08/08 12:09:11 INFO HiveMetaStore: 0: get_database: tt
>> 14/08/08 12:09:11 INFO audit: ugi=biadminip=unknown-ip-addr  
>> cmd=get_database: tt
>> 14/08/08 12:09:11 ERROR RetryingHMSHandler: 
>> NoSuchObjectException(message:There is no database named tt)
>>  at 
>> org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:431)
>>  at 
>> org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:441)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>>  at java.lang.reflect.Method.invoke(Method.java:611)
>>  at 
>> org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
>>  at $Proxy15.getDatabase(Unknown Source)
>>  at 
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:628)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>>  at java.lang.reflect.Method.invoke(Method.java:611)
>>  at 
>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
>>  at $Proxy17.get_database(Unknown Source)
>>  at 
>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:810)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>>  at java.lang.reflect.Method.invoke(Method.java:611)
>>  at 
>> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>>  at $Proxy18.getDatabase(Unknown Source)
>>  at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1139)
>>  at 
>> org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128)
>>  at 
>> org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3479)
>>  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
>>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
>>  at 
>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
>>  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
>>  at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
>>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
>>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
>>  at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:208)
>>  at 
>> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:182)
>>  at 
>> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:272)
>>  at 
>> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:269)
>>  at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:86)
>>  at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:91)
>>  at 
>> org.apache.spark.examples.sql.hive.HiveSpark$.main(HiveSpark.scala:35)
>>  at org.apache.spark.examples.sql.hive.HiveSpark.main(HiveSpark.scala)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>>  at java.lang.reflect.Method.invoke(Method.java:611)
>>  at 
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
>>
>> 14/08/08 12:09:11 ERROR DDLTask: 
>> org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: tt
>>  at 
>> org.apache.hadoop.hive.ql.exec.DDLTask.switchD

Re: Spark SQL JDBC

2014-08-11 Thread Cheng Lian
Hi John, the JDBC Thrift server resides in its own build profile and need
to be enabled explicitly by ./sbt/sbt -Phive-thriftserver assembly.
​


On Tue, Aug 5, 2014 at 4:54 AM, John Omernik  wrote:

> I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with
> the JDBC thrift server.  I have everything compiled correctly, I can access
> data in spark-shell on yarn from my hive installation. Cached tables, etc
> all work.
>
> When I execute ./sbin/start-thriftserver.sh
>
> I get the error below. Shouldn't it just ready my spark-env? I guess I am
> lost on how to make this work.
>
> Thanks1
>
> $ ./start-thriftserver.sh
>
>
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
>
> Exception in thread "main" java.lang.ClassNotFoundException:
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:270)
>
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:311)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>


Re: increase parallelism of reading from hdfs

2014-08-11 Thread Paul Hamilton
Hi Chen,

You need to set the max input split size so that the underlying hadoop
libraries will calculate the splits appropriately.  I have done the
following successfully:

val job = new Job()
FileInputFormat.setMaxInputSplitSize(job, 12800L)

And then use job.getConfiguration when creating a NewHadoopRDD.

I am sure there is some way to use it with convenience methods like
SparkContext.textFile, you could probably set the system property
"mapreduce.input.fileinputformat.split.maxsize".

Regards,
Paul Hamilton

From:  Chen Song 
Date:  Friday, August 8, 2014 at 9:13 PM
To:  "user@spark.apache.org" 
Subject:  increase parallelism of reading from hdfs


In Spark Streaming, StreamContext.fileStream gives a FileInputDStream.
Within each batch interval, it would launch map tasks for the new files
detected during that interval. It appears that the way Spark compute the
number of map tasks is based
 oo block size of files.

Below is the quote from Spark documentation.

 Spark automatically sets the number of ³map² tasks to run on each file
according to its size (though you can control
 it through optional parameters to SparkContext.textFile,
 etc)

In my testing, if files are loaded as 512M blocks, each map task seems to
process 512M chunk of data, no matter what value I set dfs.blocksize on
driver/executor. I am wondering if there is a way to increase parallelism,
say let each map read 128M data
 and increase the number of map tasks?


-- 
Chen Song


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running a task once on each executor

2014-08-11 Thread RodrigoB
Hi Christopher,

I am also in the need of having a single function call on the node level.
Your suggestion makes sense as a solution to the requirement, but still
feels like a workaround, this check will get called on every row...Also
having static members and methods created specially on a multi-threaded
environment is bad code smell. 

Would be nice to have a way of having a way of exposing the nodes that would
allow simply invoking a function from the driver to the nodes without having
to do any transformation and looping through every record. Would be more
efficient and more flexible from a user's perspective.

Tnks,
Rod





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p11908.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Random Forest implementation in MLib

2014-08-11 Thread Sameer Tilak
Hi All,I read on the mailing list that random forest implementation was on the 
roadmap. I wanted to check about its status? We are currently using Weka and 
would like to move over to MLib for performance.
  

Re: Can I share the RDD between multiprocess

2014-08-11 Thread Ruchir Jha
Look at: https://github.com/ooyala/spark-jobserver


On Mon, Aug 11, 2014 at 11:48 AM, coolfrood  wrote:

> Reviving this discussion again...
>
> I'm interested in using Spark as the engine for a web service.
>
> The SparkContext and its RDDs only exist in the JVM that started it.  While
> RDDs are resilient, this means the context owner isn't resilient, so I may
> be able to serve requests out of a single "service" JVM, but I'll lose all
> my RDDs if the service dies.
>
> It's possible to share RDDs by writing them into Tachyon, but with that
> I'll
> end up having at least 2 copies of the same data in memory; even more if I
> access the data from multiple contexts.
>
> Is there a way around this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-I-share-the-RDD-between-multiprocess-tp916p11901.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [MLLib]:choosing the Loss function

2014-08-11 Thread SK
Hi,

Thanks for the reference to the LBFGS optimizer. 
I tried to use the LBFGS optimizer, but I am not able to pass it  as an
input to the LogisticRegression model for binary classification. After
studying the code in mllib/classification/LogisticRegression.scala, it
appears that the  only implementation of LogisticRegression uses
GradientDescent as a fixed optimizer. In other words, I dont see a
setOptimizer() function that I can use to change the optimizer to LBFGS.

I tried to follow the code in
https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
that makes use of LBFGS, but it is not clear to me where  the
LogisticRegression  model with LBFGS is being returned that I can use for
the classification of the test dataset. 

If some one has sample code that uses LogisticRegression with LBFGS instead
of gradientDescent as the optimization algorithm, it would be helpful if you
can post it.

thanks 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738p11913.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



mllib style

2014-08-11 Thread Koert Kuipers
i was just looking at ALS
(mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala)

any need all the variables need to be vars and to have all these setters
around? it just leads to so much clutter

if you really want them to the vars it is safe in scala to make them public
(scala creates setters and getters anyhow for you, so you can change a var
to methods later if you want other variables, without breaking APIs).

but a more scala-friendly style would be vals and no setters...

best, koert


Failed jobs show up as succeeded in YARN?

2014-08-11 Thread Shay Rojansky
Spark 1.0.2, Python, Cloudera 5.1 (Hadoop 2.3.0)

It seems that Python jobs I'm sending to YARN show up as succeeded even if
they failed... Am I doing something wrong, is this a known issue?

Thanks,

Shay


spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread DNoteboom
Currently my code uses commons-pool version 1.6 but Spark uses commons-pool
version 1.54. This causes an error when I try to access a method that is
visible in 1.6 but not in 1.54. I tried to fix this by setting the
userClassPathFirst=true(and I verified that this was set correctly in
http://:4040 and that my jars are listed in the user-jars). The
problem did not go away which means that Spark is not functioning correctly.
I added commons-pool-version 1.6 in front of CLASS_PATH in the
bin/compute_classpath.sh file and this got rid of my problem, but this is a
hacky fix and not a long term solution. 

I have looked through the Spark source code and it appears to check the
URLClassLoader for the user classpath first. I have tried to determine how
the user jars are being added to the list of URLs to the classloader with
little success. At this point I was trying to debug by installing core-spark
so I can edit the source code and then injecting the modified .class files
into the Spark-assembly-jar that is in spark-hadoop with little success.
Does anyone know why this doesn't work, or have any solutions for this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-files-userClassPathFirst-true-Not-Working-Correctly-tp11917.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread Marcelo Vanzin
Could you share what's the cluster manager you're using and exactly
where the error shows up (driver or executor)?

A quick look reveals that Standalone and Yarn use different options to
control this, for example. (Maybe that already should be a bug.)

On Mon, Aug 11, 2014 at 12:24 PM, DNoteboom  wrote:
> Currently my code uses commons-pool version 1.6 but Spark uses commons-pool
> version 1.54. This causes an error when I try to access a method that is
> visible in 1.6 but not in 1.54. I tried to fix this by setting the
> userClassPathFirst=true(and I verified that this was set correctly in
> http://:4040 and that my jars are listed in the user-jars). The
> problem did not go away which means that Spark is not functioning correctly.
> I added commons-pool-version 1.6 in front of CLASS_PATH in the
> bin/compute_classpath.sh file and this got rid of my problem, but this is a
> hacky fix and not a long term solution.
>
> I have looked through the Spark source code and it appears to check the
> URLClassLoader for the user classpath first. I have tried to determine how
> the user jars are being added to the list of URLs to the classloader with
> little success. At this point I was trying to debug by installing core-spark
> so I can edit the source code and then injecting the modified .class files
> into the Spark-assembly-jar that is in spark-hadoop with little success.
> Does anyone know why this doesn't work, or have any solutions for this?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-files-userClassPathFirst-true-Not-Working-Correctly-tp11917.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Compile spark code with idea succesful but run SparkPi error with "java.lang.SecurityException"

2014-08-11 Thread Ron's Yahoo!
Not sure what your environment is but this happened to me before because I had 
a couple of servlet-api jars in the path which did not match.
I was building a system that programmatically submitted jobs so I had my own 
jars that conflicted with that of spark. The solution is do mvn dependency:tree 
from your app and see what jars you have an ensure to exclude those.

Thanks,
Ron

On Aug 11, 2014, at 6:36 AM, Zhanfeng Huo  wrote:

> Hi,
> 
> I have compiled spark-1.0.1 code with Intellij Idea 13.1.4 on ubuntu 
> 14.04 succesful but when I run SparkPi Example in local mode it failed .
> 
> I have set env " export SPARK_HADOOP_VERSION=2.3.0 and export  
> SPARK_YARN=true" before I  start Idea.
> 
> I have attemped to use  patch 
> @https://github.com/apache/spark/pull/1271/files , but it doesn't effect. How 
> can I solve this problem?
> 
>  
>The full message:
> 
> 14/08/11 22:15:56 INFO Utils: Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties 
> 14/08/11 22:15:56 WARN Utils: Your hostname, syn resolves to a loopback 
> address: 127.0.1.1; using 192.168.159.132 instead (on interface eth0) 
> 14/08/11 22:15:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address 
> 14/08/11 22:15:56 INFO SecurityManager: Changing view acls to: syn 
> 14/08/11 22:15:56 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(syn) 
> 14/08/11 22:15:57 INFO Slf4jLogger: Slf4jLogger started 
> 14/08/11 22:15:57 INFO Remoting: Starting remoting 
> 14/08/11 22:15:57 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://spark@192.168.159.132:50914] 
> 14/08/11 22:15:57 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://spark@192.168.159.132:50914]
> 14/08/11 22:15:57 INFO SparkEnv: Registering MapOutputTracker 
> 14/08/11 22:15:57 INFO SparkEnv: Registering BlockManagerMaster 
> 14/08/11 22:15:57 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-local-20140811221557-dd19 
> 14/08/11 22:15:57 INFO MemoryStore: MemoryStore started with capacity 804.3 
> MB. 
> 14/08/11 22:15:57 INFO ConnectionManager: Bound socket to port 56061 with id 
> = ConnectionManagerId(192.168.159.132,56061) 
> 14/08/11 22:15:57 INFO BlockManagerMaster: Trying to register BlockManager 
> 14/08/11 22:15:57 INFO BlockManagerInfo: Registering block manager 
> 192.168.159.132:56061 with 804.3 MB RAM 
> 14/08/11 22:15:57 INFO BlockManagerMaster: Registered BlockManager 
> 14/08/11 22:15:57 INFO HttpServer: Starting HTTP Server 
> 14/08/11 22:15:57 INFO HttpBroadcast: Broadcast server started at 
> http://192.168.159.132:39676 
> 14/08/11 22:15:57 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-f8474345-0dcd-41c4-9247-3e916d409b27 
> 14/08/11 22:15:57 INFO HttpServer: Starting HTTP Server 
> Exception in thread "main" java.lang.SecurityException: class 
> "javax.servlet.FilterRegistration"'s signer information does not match signer 
> information of other classes in the same package 
> at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) 
> at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) 
> at java.lang.ClassLoader.defineClass(ClassLoader.java:794) 
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) 
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71) 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361) 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354) 
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425) 
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) 
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358) 
> at 
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:136)
>  
> at 
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:129)
>  
> at 
> org.eclipse.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:98)
>  
> at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:98) 
> at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:89) 
> at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:64) 
> at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:57) 
> at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:57) 
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:57) 
> at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:66) 
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:60) 
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:42
> at org.apache.spark.SparkContext.(SparkContext.scala:

Re: increase parallelism of reading from hdfs

2014-08-11 Thread Chen Song
Thanks Paul. I will give a try.


On Mon, Aug 11, 2014 at 1:11 PM, Paul Hamilton 
wrote:

> Hi Chen,
>
> You need to set the max input split size so that the underlying hadoop
> libraries will calculate the splits appropriately.  I have done the
> following successfully:
>
> val job = new Job()
> FileInputFormat.setMaxInputSplitSize(job, 12800L)
>
> And then use job.getConfiguration when creating a NewHadoopRDD.
>
> I am sure there is some way to use it with convenience methods like
> SparkContext.textFile, you could probably set the system property
> "mapreduce.input.fileinputformat.split.maxsize".
>
> Regards,
> Paul Hamilton
>
> From:  Chen Song 
> Date:  Friday, August 8, 2014 at 9:13 PM
> To:  "user@spark.apache.org" 
> Subject:  increase parallelism of reading from hdfs
>
>
> In Spark Streaming, StreamContext.fileStream gives a FileInputDStream.
> Within each batch interval, it would launch map tasks for the new files
> detected during that interval. It appears that the way Spark compute the
> number of map tasks is based
>  oo block size of files.
>
> Below is the quote from Spark documentation.
>
>  Spark automatically sets the number of ³map² tasks to run on each file
> according to its size (though you can control
>  it through optional parameters to SparkContext.textFile,
>  etc)
>
> In my testing, if files are loaded as 512M blocks, each map task seems to
> process 512M chunk of data, no matter what value I set dfs.blocksize on
> driver/executor. I am wondering if there is a way to increase parallelism,
> say let each map read 128M data
>  and increase the number of map tasks?
>
>
> --
> Chen Song
>
>


-- 
Chen Song


RE: Spark on an HPC setup

2014-08-11 Thread Sidharth Kashyap
Hi Jeremy,
Thanks for the reply.
We got Spark on our setup after a similar script was brought up to work with 
LSF.
Really appreciate your help. 
Will keep in touch on Twitter
Thanks,@sidkashyap :)

From: freeman.jer...@gmail.com
Subject: Re: Spark on an HPC setup
Date: Thu, 29 May 2014 00:37:54 -0400
To: user@spark.apache.org

Hi Sid,
We are successfully running Spark on an HPC, it works great. Here's info on our 
setup / approach.
We have a cluster with 256 nodes running Scientific Linux 6.3 and scheduled by 
Univa Grid Engine.  The environment also has a DDN GridScalar running GPFS and 
several EMC Isilon clusters serving NFS to the compute cluster.
We wrote a custom qsub job to spin up Spark dynamically on a user-designated 
quantity of nodes. The UGE scheduler first designates a set of nodes that will 
be used to run Spark. Once the nodes are available, we use start-master.sh 
script to launch a master, and send it the addresses of the other nodes. The 
master then starts the workers with start-all.sh. At that point, the Spark 
cluster is usable and remains active until the user issues a qdel, which 
triggers the stop-all.sh on the master, and takes down the cluster. 
This worked well for us because users can pick the number of nodes to suit 
their job, and multiple users can run their own Spark clusters on the same 
system (alongside other non-Spark jobs).
We don't use HDFS for the filesystem, instead relying on NFS and GPFS, and the 
cluster is not running Hadoop. In tests, we've seen similar performance between 
our set up, and using Spark w/ HDFS on EC2 with higher-end instances (matched 
roughly for memory and number of cores).
Unfortunately we can't open source the launched scripts because they contain 
proprietary UGE stuff, but happy to try and answer any follow-up questions.
-- Jeremy

-
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab



On May 28, 2014, at 11:02 AM, Sidharth Kashyap  
wrote:Hi,
Has anyone tried to get Spark working on an HPC setup?If yes, can you please 
share your learnings and how you went about doing it?
An HPC setup typically comes bundled with dynamically allocated cluster and a 
very efficient scheduler.
Configuring Spark standalone in this mode of operation is challenging as the 
Hadoop dependencies need to be eliminated and the cluster needs to be 
configured on the fly.
Thanks,Sid


  

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread DNoteboom
I'm currently running on my local machine on standalone. The error shows up
in my code when I am closing resources using the
TaskContext.addOnCompleteCallBack. However, the cause of this error is
because of a faulty classLoader which must occur in the Executor in the
function createClassLoader. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-files-userClassPathFirst-true-Not-Working-Correctly-tp11917p11921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Gathering Information about Standalone Cluster

2014-08-11 Thread Wonha Ryu
Hey all,

Is there any kind of API to access information about resources, executors,
and applications in a standalone cluster displayed in the web UI?
Currently I'm using 1.0.x, but interested in experimenting with bleeding
edge.

Thanks,
Wonha


Re: Random Forest implementation in MLib

2014-08-11 Thread DB Tsai
We have an open-sourced Random Forest at Alpine Data Labs with the Apache
license. We're also trying to have it merged into Spark MLlib now.

https://github.com/AlpineNow/alpineml

It's been tested a lot, and the accuracy and training time benchmark is
great. There could be some bugs here and there, so we're looking forward to
your feedback, and please let us know what you think.

We'll continue to improve it and we'll be adding Gradient Boosting in the
near future as well.

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Aug 11, 2014 at 10:52 AM, Sameer Tilak  wrote:

> Hi All,
> I read on the mailing list that random forest implementation was on the
> roadmap. I wanted to check about its status? We are currently using Weka
> and would like to move over to MLib for performance.
>


Re: [MLLib]:choosing the Loss function

2014-08-11 Thread Burak Yavuz
Hi,

// Initialize the optimizer using logistic regression as the loss function with 
L2 regularization
val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater())

// Set the hyperparameters
lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor)

// Retrieve the weights
val weightsWithIntercept = lbfgs.optimize(data, initialWeights)

//Slice weights with intercept into weight and intercept

//Initialize Logistic Regression Model
val model = new LogisticRegressionModel(weights, intercept)

model.predict(test) //Make your predictions

The example code doesn't generate the Logistic Regression Model that you can 
make predictions with.

`LBFGS.runMiniBatchLBFGS` outputs a tuple of (weights, lossHistory). The 
example code was for a benchmark, so it was interested more
in the loss history than the model itself.

You can also run
`val (weightsWithIntercept, localLoss) = LBFGS.runMiniBatchLBFGS ...`

slice `weightsWithIntercept` into the intercept and the rest of the weights and 
instantiate the model again as:
val model = new LogisticRegressionModel(weights, intercept)


Burak



- Original Message -
From: "SK" 
To: u...@spark.incubator.apache.org
Sent: Monday, August 11, 2014 11:52:04 AM
Subject: Re: [MLLib]:choosing the Loss function

Hi,

Thanks for the reference to the LBFGS optimizer. 
I tried to use the LBFGS optimizer, but I am not able to pass it  as an
input to the LogisticRegression model for binary classification. After
studying the code in mllib/classification/LogisticRegression.scala, it
appears that the  only implementation of LogisticRegression uses
GradientDescent as a fixed optimizer. In other words, I dont see a
setOptimizer() function that I can use to change the optimizer to LBFGS.

I tried to follow the code in
https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
that makes use of LBFGS, but it is not clear to me where  the
LogisticRegression  model with LBFGS is being returned that I can use for
the classification of the test dataset. 

If some one has sample code that uses LogisticRegression with LBFGS instead
of gradientDescent as the optimization algorithm, it would be helpful if you
can post it.

thanks 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738p11913.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Job ACL's on SPark

2014-08-11 Thread Manoj kumar
Hi Friends,

Any response on this, I looked into documentation but could not get any
information

--Manoj



On Fri, Aug 8, 2014 at 6:56 AM, Manoj kumar 
wrote:

> Hi Team,
>
>
>
> Do we have Job ACL's for Spark which is similar to Hadoop Job ACL’s.
>
>
> Where I can restrict who can submit the Job to the Spark Master service.
>
> In our hadoop cluster we enabled Job ACL;s by using job queues and
> restricting the default queues and have Fair scheduler for managing the
> workloads on the platform
>
>
> Do we have similar functionality in Spark, I have seen some references to
> fair scheduler pools but could not get much for Job Queues
>
>
> please advise :)
>
>
> --Manoj
>


Re: [MLLib]:choosing the Loss function

2014-08-11 Thread DB Tsai
Hi SK,

I'm working on a PR of adding a logistic regression interface with LBFGS.
It will be merged in Spark 1.1 release, I hope.
https://github.com/apache/spark/pull/1862

Before merging, you can just copy the code into your application to use it.
I'm also working on another PR which automatically rescale the training set
to improve the condition number of the optimization process. After
training, the scaling factors will be integrated back to weights so the
whole thing is transparent to users. Libsvm and glmnet do this to deal with
dataset that has huge variance in some columns.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Aug 11, 2014 at 2:21 PM, Burak Yavuz  wrote:

> Hi,
>
> // Initialize the optimizer using logistic regression as the loss function
> with L2 regularization
> val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater())
>
> // Set the hyperparameters
>
> lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor)
>
> // Retrieve the weights
> val weightsWithIntercept = lbfgs.optimize(data, initialWeights)
>
> //Slice weights with intercept into weight and intercept
>
> //Initialize Logistic Regression Model
> val model = new LogisticRegressionModel(weights, intercept)
>
> model.predict(test) //Make your predictions
>
> The example code doesn't generate the Logistic Regression Model that you
> can make predictions with.
>
> `LBFGS.runMiniBatchLBFGS` outputs a tuple of (weights, lossHistory). The
> example code was for a benchmark, so it was interested more
> in the loss history than the model itself.
>
> You can also run
> `val (weightsWithIntercept, localLoss) = LBFGS.runMiniBatchLBFGS ...`
>
> slice `weightsWithIntercept` into the intercept and the rest of the
> weights and instantiate the model again as:
> val model = new LogisticRegressionModel(weights, intercept)
>
>
> Burak
>
>
>
> - Original Message -
> From: "SK" 
> To: u...@spark.incubator.apache.org
> Sent: Monday, August 11, 2014 11:52:04 AM
> Subject: Re: [MLLib]:choosing the Loss function
>
> Hi,
>
> Thanks for the reference to the LBFGS optimizer.
> I tried to use the LBFGS optimizer, but I am not able to pass it  as an
> input to the LogisticRegression model for binary classification. After
> studying the code in mllib/classification/LogisticRegression.scala, it
> appears that the  only implementation of LogisticRegression uses
> GradientDescent as a fixed optimizer. In other words, I dont see a
> setOptimizer() function that I can use to change the optimizer to LBFGS.
>
> I tried to follow the code in
>
> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
> that makes use of LBFGS, but it is not clear to me where  the
> LogisticRegression  model with LBFGS is being returned that I can use for
> the classification of the test dataset.
>
> If some one has sample code that uses LogisticRegression with LBFGS instead
> of gradientDescent as the optimization algorithm, it would be helpful if
> you
> can post it.
>
> thanks
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738p11913.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Jenny Zhao
you can reproduce this issue with the following steps (assuming you have
Yarn cluster + Hive 12):

1) using hive shell, create a database, e.g: create database ttt

2) write a simple spark sql program

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext

object HiveSpark {
  case class Record(key: Int, value: String)

  def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HiveSpark")
val sc = new SparkContext(sparkConf)

// A hive context creates an instance of the Hive Metastore in process,
val hiveContext = new HiveContext(sc)
import hiveContext._

hql("use ttt")
hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
hql("LOAD DATA INPATH '/user/biadmin/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
println("Result of 'SELECT *': ")
hql("SELECT * FROM src").collect.foreach(println)
sc.stop()
  }
}
3) run it in yarn-cluster mode.


On Mon, Aug 11, 2014 at 9:44 AM, Cheng Lian  wrote:

> Since you were using hql(...), it’s probably not related to JDBC driver.
> But I failed to reproduce this issue locally with a single node pseudo
> distributed YARN cluster. Would you mind to elaborate more about steps to
> reproduce this bug? Thanks
> ​
>
>
> On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian  wrote:
>
>> Hi Jenny, does this issue only happen when running Spark SQL with YARN in
>> your environment?
>>
>>
>> On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> I am able to run my hql query on yarn cluster mode when connecting to
>>> the default hive metastore defined in hive-site.xml.
>>>
>>> however, if I want to switch to a different database, like:
>>>
>>>   hql("use other-database")
>>>
>>>
>>> it only works in yarn client mode, but failed on yarn-cluster mode with
>>> the following stack:
>>>
>>> 14/08/08 12:09:11 INFO HiveMetaStore: 0: get_database: tt
>>> 14/08/08 12:09:11 INFO audit: ugi=biadmin   ip=unknown-ip-addr  
>>> cmd=get_database: tt
>>> 14/08/08 12:09:11 ERROR RetryingHMSHandler: 
>>> NoSuchObjectException(message:There is no database named tt)
>>> at 
>>> org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:431)
>>> at 
>>> org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:441)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>>> at java.lang.reflect.Method.invoke(Method.java:611)
>>> at 
>>> org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
>>> at $Proxy15.getDatabase(Unknown Source)
>>> at 
>>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:628)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>>> at java.lang.reflect.Method.invoke(Method.java:611)
>>> at 
>>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
>>> at $Proxy17.get_database(Unknown Source)
>>> at 
>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:810)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>>> at java.lang.reflect.Method.invoke(Method.java:611)
>>> at 
>>> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>>> at $Proxy18.getDatabase(Unknown Source)
>>> at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1139)
>>> at 
>>> org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3479)
>>> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
>>> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
>>> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
>>> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
>>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
>>> at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:208)
>>> at 
>>> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:182)
>>> a

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Yin Huai
Hi Jenny,

How's your metastore configured for both Hive and Spark SQL? Which
metastore mode are you using (based on
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin
)?

Thanks,

Yin


On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao  wrote:

>
>
> you can reproduce this issue with the following steps (assuming you have
> Yarn cluster + Hive 12):
>
> 1) using hive shell, create a database, e.g: create database ttt
>
> 2) write a simple spark sql program
>
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql._
> import org.apache.spark.sql.hive.HiveContext
>
> object HiveSpark {
>   case class Record(key: Int, value: String)
>
>   def main(args: Array[String]) {
> val sparkConf = new SparkConf().setAppName("HiveSpark")
> val sc = new SparkContext(sparkConf)
>
> // A hive context creates an instance of the Hive Metastore in process,
> val hiveContext = new HiveContext(sc)
> import hiveContext._
>
> hql("use ttt")
> hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> hql("LOAD DATA INPATH '/user/biadmin/kv1.txt' INTO TABLE src")
>
> // Queries are expressed in HiveQL
> println("Result of 'SELECT *': ")
> hql("SELECT * FROM src").collect.foreach(println)
> sc.stop()
>   }
> }
> 3) run it in yarn-cluster mode.
>
>
> On Mon, Aug 11, 2014 at 9:44 AM, Cheng Lian  wrote:
>
>> Since you were using hql(...), it’s probably not related to JDBC driver.
>> But I failed to reproduce this issue locally with a single node pseudo
>> distributed YARN cluster. Would you mind to elaborate more about steps to
>> reproduce this bug? Thanks
>> ​
>>
>>
>> On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian 
>> wrote:
>>
>>> Hi Jenny, does this issue only happen when running Spark SQL with YARN
>>> in your environment?
>>>
>>>
>>> On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao 
>>> wrote:
>>>

 Hi,

 I am able to run my hql query on yarn cluster mode when connecting to
 the default hive metastore defined in hive-site.xml.

 however, if I want to switch to a different database, like:

   hql("use other-database")


 it only works in yarn client mode, but failed on yarn-cluster mode with
 the following stack:

 14/08/08 12:09:11 INFO HiveMetaStore: 0: get_database: tt
 14/08/08 12:09:11 INFO audit: ugi=biadmin  ip=unknown-ip-addr  
 cmd=get_database: tt
 14/08/08 12:09:11 ERROR RetryingHMSHandler: 
 NoSuchObjectException(message:There is no database named tt)
at 
 org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:431)
at 
 org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:441)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at 
 org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
at $Proxy15.getDatabase(Unknown Source)
at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:628)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at 
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
at $Proxy17.get_database(Unknown Source)
at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:810)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at $Proxy18.getDatabase(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1139)
at 
 org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128)
at 
 org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3479)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
>

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread Haoyuan Li
Is the speculative execution enabled?

Best,

Haoyuan


On Mon, Aug 11, 2014 at 8:08 AM, chutium  wrote:

> sharing /reusing RDDs is always useful for many use cases, is this possible
> via persisting RDD on tachyon?
>
> such as off heap persist a named RDD into a given path (instead of
> /tmp_spark_tachyon/spark-xxx-xxx-xxx)
> or
> saveAsParquetFile on tachyon
>
> i tried to save a SchemaRDD on tachyon,
>
> val parquetFile =
>
> sqlContext.parquetFile("hdfs://test01.zala:8020/user/hive/warehouse/parquet_tables.db/some_table/")
> parquetFile.saveAsParquetFile("tachyon://test01.zala:19998/parquet_1")
>
> but always error, first error message is:
>
> 14/08/11 16:19:28 INFO storage.BlockManagerInfo: Added broadcast_6_piece0
> in
> memory on test03.zala:37377 (size: 18.7 KB, free: 16.6 GB)
> 14/08/11 16:20:06 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 3.0
> (TID 35, test04.zala): java.io.IOException:
> FailedToCheckpointException(message:Failed to rename
> hdfs://test01.zala:8020/tmp/tachyon/workers/140776003/31806/730 to
> hdfs://test01.zala:8020/tmp/tachyon/data/730)
> tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:112)
> tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:168)
> tachyon.client.FileOutStream.close(FileOutStream.java:104)
>
>
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
>
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
> parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
>
>
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
>
> parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
>
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org
> $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:259)
>
>
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
>
>
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:722)
>
>
>
> hdfs://test01.zala:8020/tmp/tachyon/
> already chmod to 777, both owner and group is same as spark/tachyon startup
> user
>
> off-heap persist or saveAs normal text file on tachyon works fine.
>
> CDH 5.1.0, spark 1.1.0 snapshot, tachyon 0.6 snapshot
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/share-reuse-off-heap-persisted-tachyon-RDD-in-SparkContext-or-saveAsParquetFile-on-tachyon-in-SQLCont-tp11897.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Jenny Zhao
Thanks Yin!

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't
experience problem connecting to the metastore through hive. which uses DB2
as metastore database.





 
  hive.hwi.listen.port
  
 
 
  hive.querylog.location
  /var/ibm/biginsights/hive/query/${user.name}
 
 
  hive.metastore.warehouse.dir
  /biginsights/hive/warehouse
 
 
  hive.hwi.war.file
  lib/hive-hwi-0.12.0.war
 
 
  hive.metastore.metrics.enabled
  true
 
 
  javax.jdo.option.ConnectionURL
  jdbc:db2://hdtest022.svl.ibm.com:50001/BIDB
 
 
  javax.jdo.option.ConnectionDriverName
  com.ibm.db2.jcc.DB2Driver
 
 
  hive.stats.autogather
  false
 
 
  javax.jdo.mapping.Schema
  HIVE
 
 
  javax.jdo.option.ConnectionUserName
  catalog
 
 
  javax.jdo.option.ConnectionPassword
  V2pJNWMxbFlVbWhaZHowOQ==
 
 
  hive.metastore.password.encrypt
  true
 
 
  org.jpox.autoCreateSchema
  true
 
 
  hive.server2.thrift.min.worker.threads
  5
 
 
  hive.server2.thrift.max.worker.threads
  100
 
 
  hive.server2.thrift.port
  1
 
 
  hive.server2.thrift.bind.host
  hdtest022.svl.ibm.com
 
 
  hive.server2.authentication
  CUSTOM
 
 
  hive.server2.custom.authentication.class

org.apache.hive.service.auth.WebConsoleAuthenticationProviderImpl
 
 
  hive.server2.enable.impersonation
  true
 
 
  hive.security.webconsole.url
  http://hdtest022.svl.ibm.com:8080
 
 
  hive.security.authorization.enabled
  true
 
 
  hive.security.authorization.createtable.owner.grants
  ALL
 




On Mon, Aug 11, 2014 at 4:29 PM, Yin Huai  wrote:

> Hi Jenny,
>
> How's your metastore configured for both Hive and Spark SQL? Which
> metastore mode are you using (based on
> https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin
> )?
>
> Thanks,
>
> Yin
>
>
> On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao 
> wrote:
>
>>
>>
>> you can reproduce this issue with the following steps (assuming you have
>> Yarn cluster + Hive 12):
>>
>> 1) using hive shell, create a database, e.g: create database ttt
>>
>> 2) write a simple spark sql program
>>
>> import org.apache.spark.{SparkConf, SparkContext}
>> import org.apache.spark.sql._
>> import org.apache.spark.sql.hive.HiveContext
>>
>> object HiveSpark {
>>   case class Record(key: Int, value: String)
>>
>>   def main(args: Array[String]) {
>> val sparkConf = new SparkConf().setAppName("HiveSpark")
>> val sc = new SparkContext(sparkConf)
>>
>> // A hive context creates an instance of the Hive Metastore in
>> process,
>> val hiveContext = new HiveContext(sc)
>> import hiveContext._
>>
>> hql("use ttt")
>> hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
>> hql("LOAD DATA INPATH '/user/biadmin/kv1.txt' INTO TABLE src")
>>
>> // Queries are expressed in HiveQL
>> println("Result of 'SELECT *': ")
>> hql("SELECT * FROM src").collect.foreach(println)
>> sc.stop()
>>   }
>> }
>> 3) run it in yarn-cluster mode.
>>
>>
>> On Mon, Aug 11, 2014 at 9:44 AM, Cheng Lian 
>> wrote:
>>
>>> Since you were using hql(...), it’s probably not related to JDBC
>>> driver. But I failed to reproduce this issue locally with a single node
>>> pseudo distributed YARN cluster. Would you mind to elaborate more about
>>> steps to reproduce this bug? Thanks
>>> ​
>>>
>>>
>>> On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian 
>>> wrote:
>>>
 Hi Jenny, does this issue only happen when running Spark SQL with YARN
 in your environment?


 On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao 
 wrote:

>
> Hi,
>
> I am able to run my hql query on yarn cluster mode when connecting to
> the default hive metastore defined in hive-site.xml.
>
> however, if I want to switch to a different database, like:
>
>   hql("use other-database")
>
>
> it only works in yarn client mode, but failed on yarn-cluster mode
> with the following stack:
>
> 14/08/08 12:09:11 INFO HiveMetaStore: 0: get_database: tt
> 14/08/08 12:09:11 INFO audit: ugi=biadmin ip=unknown-ip-addr  
> cmd=get_database: tt
> 14/08/08 12:09:11 ERROR RetryingHMSHandler: 
> NoSuchObjectException(message:There is no database named tt)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:431)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:441)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>   at java.lang.reflect.Method.invoke(Method.java:611)
>   at 
> org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
>   at $Proxy15.getDatabase(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:628)
>   at sun.r

Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Tobias Pfeiffer
Hi,

On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers <
gwenhael.pasqui...@ericsson.com> wrote:
>
> We intend to apply other operations on the data later in the same spark
> context, but our first step is to archive it.
>
>
>
> Our goal is somth like this
>
> Step 1 : consume kafka
> Step 2 : archive to hdfs AND send to step 3
> Step 3 : transform data
>
> Step 4 : save transformed data to HDFS as input for M/R
>

I see. Well I think Spark Streaming may be well suited for that purpose.


> To us it looks like a great flaw if, in streaming mode, spark-streaming
> cannot slow down it’s consumption depending on the available resources.
>

On Mon, Aug 11, 2014 at 10:10 PM, Gwenhael Pasquiers <
gwenhael.pasqui...@ericsson.com> wrote:
>
> I think the kind of self-regulating system you describe would be too
> difficult to implement and probably unreliable (even more with the fact
> that we have multiple slaves).
>

Isn't "slow down its consumption depending on the available resources" a
"self-regulating system"? I don't see how you can adapt to available
resources without measuring your execution time and then change how much
you consume. Did you have any particular form of adaption in mind?

Tobias


Spark streaming error - Task not serializable

2014-08-11 Thread Xuri Nagarin
Hi,

I have some quick/dirty code here running in Spark 1.0.0 (CDH 5.1,
Spark in Yarn cluster mode)

import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
import kafka.producer._
import java.util.Properties

def toKafka(str: String) {
val props = new Properties()
props.put("metadata.broker.list", "host1:9092,host2:902")
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("producer.type", "async")
props.put("request.required.acks", "1")

val topic = "normstruct"
val config = new ProducerConfig(props)
val producer = new Producer[String, String](config)
val kOutMsg = new KeyedMessage[String,String](topic,str)
producer.send(kOutMsg)
}

val ssc = new StreamingContext(sc,Seconds(5))
val kInMsg = 
KafkaUtils.createStream(ssc,"zkhost1:2181/zk_kafka","normApp",Map("rawunstruct"
-> 1))
kInMsg.foreach(rdd => { rdd.foreach(e => toKafka(e._2)) })
ssc.start()

Throws:
14/08/12 00:47:25 INFO DAGScheduler: Failed to run foreach at :31
14/08/12 00:47:25 ERROR JobScheduler: Error running job streaming job
1407804445000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure:
Task not serializable: java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/08/12 00:47:25 INFO ShuffleBlockManager: Could not find files for
shuffle 0 for deleting
14/08/12 00:47:25 INFO ContextCleaner: Cleaned shuffle 0


Thanks,

Xuri

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Xuri Nagarin
In general, (and I am prototyping), I have a better idea :)
- Consume kafka in Spark from topic-A
- transform data in Spark (normalize, enrich etc etc)
- Feed it back to Kafka (in a different topic-B)
- Have flume->HDFS (for M/R, Impala, Spark batch) or Spark-streaming
or any other compute framework subscribe to B




On Mon, Aug 11, 2014 at 5:57 PM, Tobias Pfeiffer  wrote:
> Hi,
>
> On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers
>  wrote:
>>
>> We intend to apply other operations on the data later in the same spark
>> context, but our first step is to archive it.
>>
>>
>>
>> Our goal is somth like this
>>
>> Step 1 : consume kafka
>> Step 2 : archive to hdfs AND send to step 3
>> Step 3 : transform data
>>
>> Step 4 : save transformed data to HDFS as input for M/R
>
>
> I see. Well I think Spark Streaming may be well suited for that purpose.
>
>>
>> To us it looks like a great flaw if, in streaming mode, spark-streaming
>> cannot slow down it’s consumption depending on the available resources.
>
>
> On Mon, Aug 11, 2014 at 10:10 PM, Gwenhael Pasquiers
>  wrote:
>>
>> I think the kind of self-regulating system you describe would be too
>> difficult to implement and probably unreliable (even more with the fact that
>> we have multiple slaves).
>
>
> Isn't "slow down its consumption depending on the available resources" a
> "self-regulating system"? I don't see how you can adapt to available
> resources without measuring your execution time and then change how much you
> consume. Did you have any particular form of adaption in mind?
>
> Tobias

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: mllib style

2014-08-11 Thread Matei Zaharia
The public API of MLlib is meant to be Java-friendly, so that's why it has 
setters and getters with Java-like names. Internal APIs don't have to be.

Matei

On August 11, 2014 at 12:08:20 PM, Koert Kuipers (ko...@tresata.com) wrote:

i was just looking at ALS 
(mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala)

any need all the variables need to be vars and to have all these setters 
around? it just leads to so much clutter

if you really want them to the vars it is safe in scala to make them public 
(scala creates setters and getters anyhow for you, so you can change a var to 
methods later if you want other variables, without breaking APIs).

but a more scala-friendly style would be vals and no setters...

best, koert


Using very large files for KMeans training -- cluster centers size?

2014-08-11 Thread durin
I'm trying to apply KMeans training to some text data, which consists of
lines that each contain something between 3 and 20 words. For that purpose,
all unique words are saved in a dictionary. This dictionary can become very
large as no hashing etc. is done, but it should spill to disk in case it
doesn't fit into memory anymore:
var dict = scala.collection.mutable.Map[String,Int]()
dict.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)

With the help of this dictionary, I build sparse feature vectors for each
line which are then saved in an RDD that is used as input for KMeans.train.

Spark is running in standalone mode, in this case with 5 worker nodes.
It appears that anything up to the actual training completes successfully
with 126G of training data (logs below).

The training data is provided in form a cached, broadcasted variable to all
worker nodes:

var vectors2 =
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector = sc.broadcast(vectors2)
println("-Start model training-");
var model = KMeans.train(broadcastVector.value, 20, 10)

The first error I get is a null pointer exception, but there is still work
done after that. I think the real reason this terminates is
java.lang.OutOfMemoryError: Java heap space.

Is it possible that this happens because the cluster centers in the model
are represented in dense instead of sparse form, thereby getting large with
a large vector size? If yes, how can I make sure it doesn't crash because of
that? It should spill to disk if necessary.
My goal would be to have the input size only limited by disk space. Sure it
would get very slow if it spills to disk all the time, but it shouldn't
terminate.



Here's the console output from the model.train part:

-Start model training-
14/08/11 17:05:17 INFO spark.SparkContext: Starting job: takeSample at
KMeans.scala:263
14/08/11 17:05:17 INFO scheduler.DAGScheduler: Registering RDD 64
(repartition at :48)
14/08/11 17:05:17 INFO scheduler.DAGScheduler: Got job 6 (takeSample at
KMeans.scala:263) with 1000 output partitions (allowLocal=false)
14/08/11 17:05:17 INFO scheduler.DAGScheduler: Final stage: Stage
8(takeSample at KMeans.scala:263)
14/08/11 17:05:17 INFO scheduler.DAGScheduler: Parents of final stage:
List(Stage 9)
14/08/11 17:05:17 INFO scheduler.DAGScheduler: Missing parents: List(Stage
9)
14/08/11 17:05:17 INFO scheduler.DAGScheduler: Submitting Stage 9
(MapPartitionsRDD[64] at repartition at :48), which has no missing
parents
4116.323: [GC (Allocation Failure) [PSYoungGen: 1867168K->240876K(2461696K)]
4385155K->3164592K(9452544K), 1.4455064 secs] [Times: user=11.33 sys=0.03,
real=1.44 secs]
4174.512: [GC (Allocation Failure) [PSYoungGen: 1679497K->763168K(2338816K)]
4603212K->3691609K(9329664K), 0.8050508 secs] [Times: user=6.04 sys=0.01,
real=0.80 secs]
4188.250: [GC (Allocation Failure) [PSYoungGen: 2071822K->986136K(2383360K)]
5000263K->4487601K(9374208K), 1.6795174 secs] [Times: user=13.23 sys=0.01,
real=1.68 secs]
14/08/11 17:06:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks
from Stage 9 (MapPartitionsRDD[64] at repartition at :48)
14/08/11 17:06:57 INFO scheduler.TaskSchedulerImpl: Adding task set 9.0 with
1 tasks
4190.947: [GC (Allocation Failure) [PSYoungGen: 2336718K->918720K(2276864K)]
5838183K->5406145K(9267712K), 1.5793066 secs] [Times: user=12.40 sys=0.02,
real=1.58 secs]
14/08/11 17:07:00 WARN scheduler.TaskSetManager: Stage 9 contains a task of
very large size (272484 KB). The maximum recommended task size is 100 KB.
14/08/11 17:07:00 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
9.0 (TID 3053, idp11.foo.bar, PROCESS_LOCAL, 279023993 bytes)
4193.607: [GC (Allocation Failure) [PSYoungGen: 2070046K->599908K(2330112K)]
6557472K->5393557K(9320960K), 0.3267949 secs] [Times: user=2.53 sys=0.01,
real=0.33 secs]
4194.645: [GC (Allocation Failure) [PSYoungGen: 1516770K->589655K(2330112K)]
6310419K->5383352K(9320960K), 0.2566507 secs] [Times: user=1.96 sys=0.00,
real=0.26 secs]
4195.815: [GC (Allocation Failure) [PSYoungGen: 1730909K->275312K(2330112K)]
6524606K->5342865K(9320960K), 0.2053884 secs] [Times: user=1.57 sys=0.00,
real=0.21 secs]
14/08/11 17:08:56 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in
memory on idp11.foo.bar:46418 (size: 136.0 B, free: 10.4 GB)
14/08/11 17:08:56 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to sp...@idp11.foo.bar:57072
14/08/11 17:10:09 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0
(TID 3053, idp11.foo.bar): java.lang.NullPointerException:
   
$line86.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:36)
   
$line86.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:36)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.coll

Re: Initial job has not accepted any resources

2014-08-11 Thread ldmtwo
I see this error too. I have never found a fix and I've been working on this
for a few months. 

For me, I have 4 nodes with 46GB and 8 cores each. If I change the executor
to use 8GB, if fails. If I use 6GB, it works. I request 2 cores only. On
another cluster, I have different limits.  My workload is extremely memory
intensive and I can't even get the smaller loads to run.

Every "solution" says that we have too few cores or RAM, but they are wrong.
Something is either misleading or not working. I'm using 1.0.1 and 1.0.2.

I have checked all nodes and see plenty of free RAM. The driver/master node
will run and do it's data loading and processing, but the executors never
start up, attach, connect or w/e to do the real work. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp11668p11938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.StackOverflowError when calling count()

2014-08-11 Thread randylu
hi, TD. I also fall into the trap of long lineage, and your suggestions do
work well. But i don't understand why the long lineage can cause stackover,
and where it takes effect?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tp5649p11941.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Benchmark on physical Spark cluster

2014-08-11 Thread Mozumder, Monir
I am trying to get some workloads or benchmarks for running on a physical spark 
cluster and find relative speedups on different physical clusters.

The instructions at 
https://databricks.com/blog/2014/02/12/big-data-benchmark.html uses Amazon EC2. 
I was wondering if anyone got other benchmarks for spark on physical clusters. 
Hoping to get a CloudSuite like suite for Spark.

Bests,
-Monir


Re: Initial job has not accepted any resources

2014-08-11 Thread 诺铁
just as Marcelo Vanzin said there are two possible reasons for this problem.
I solved reason2 several days ago.

my process is, ssh to one of the worker node, read its log output , find a
line that says
"Remoting started"
after that line their should be some line of "connecting to x"
MAKE SURE worker node can really connect to the designated host

my problem is caused by hostname misconfiguration, after fix that, problem
is solved.

the error message complaining resource allocation is really misleading in
this case.


On Tue, Aug 12, 2014 at 9:51 AM, ldmtwo  wrote:

> I see this error too. I have never found a fix and I've been working on
> this
> for a few months.
>
> For me, I have 4 nodes with 46GB and 8 cores each. If I change the executor
> to use 8GB, if fails. If I use 6GB, it works. I request 2 cores only. On
> another cluster, I have different limits.  My workload is extremely memory
> intensive and I can't even get the smaller loads to run.
>
> Every "solution" says that we have too few cores or RAM, but they are
> wrong.
> Something is either misleading or not working. I'm using 1.0.1 and 1.0.2.
>
> I have checked all nodes and see plenty of free RAM. The driver/master node
> will run and do it's data loading and processing, but the executors never
> start up, attach, connect or w/e to do the real work.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp11668p11938.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-11 Thread Ge, Yao (Y.)
I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run the training I got the following exception:
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
at 
org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)

What does this means? How do I troubleshoot this problem?
Thanks.

-Yao


Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread ZHENG, Xu-dong
Hi all,

We are trying to use Spark MLlib to train super large data (100M features
and 5B rows). The input data in HDFS has ~26K partitions. By default, MLlib
will create a task for every partition at each iteration. But because our
dimensions are also very high, such large number of tasks will increase
large network overhead to transfer the weight vector. So we want to reduce
the number of tasks, we tried below ways:

1. Coalesce partitions without shuffling, then cache.

data.coalesce(numPartitions).cache()

This works fine for relative small data, but when data is increasing and
numPartitions is fixed, the size of one partition will be large. This
introduces two issues: the first is, the larger partition will need larger
object and more memory at runtime, and trigger GC more frequently; the
second is, we meet the issue 'size exceeds integer.max_value' error, which
seems be caused by the size of one partition larger than 2G (
https://issues.apache.org/jira/browse/SPARK-1391).

2. Coalesce partitions with shuffling, then cache.

data.coalesce(numPartitions, true).cache()

It could mitigate the second issue in #1 at some degree, but fist issue is
still there, and it also will introduce large amount of shullfling.

3. Cache data first, and coalesce partitions.

data.cache().coalesce(numPartitions)

In this way, the number of cached partitions is not change, but each task
read the data from multiple partitions. However, I find the task will loss
locality by this way. I find a lot of 'ANY' tasks, that means that tasks
read data from other nodes, and become slower than that read data from
local memory.

I think the best way should like #3, but leverage locality as more as
possible. Is there any way to do that? Any suggestions?

Thanks!

-- 
ZHENG, Xu-dong


Transform RDD[List]

2014-08-11 Thread Kevin Jung
Hi
It may be simple question, but I can not figure out the most efficient way.
There is a RDD containing list.

RDD
(
 List(1,2,3,4,5)
 List(6,7,8,9,10)
)

I want to transform this to

RDD
(
List(1,6)
List(2,7)
List(3,8)
List(4,9)
List(5,10)
)

And I want to achieve this without using collect method because realworld
RDD can have a lot of elements then it may cause out of memory.
Any ideas will be welcome.

Best regards
Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Transform-RDD-List-tp11948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Transform RDD[List]

2014-08-11 Thread Soumya Simanta
Try something like this.


scala> val a = sc.parallelize(List(1,2,3,4,5))

a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize
at :12


scala> val b = sc.parallelize(List(6,7,8,9,10))

b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize
at :12


scala> val x = a zip b

x: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedRDD[3] at zip at
:16


scala> val f = x.map( x => List(x._1,x._2) )

f: org.apache.spark.rdd.RDD[List[Int]] = MappedRDD[5] at map at :18


scala> f.foreach(println)

List(2, 7)

List(1, 6)

List(5, 10)

List(3, 8)

List(4, 9)


On Tue, Aug 12, 2014 at 12:42 AM, Kevin Jung  wrote:

> Hi
> It may be simple question, but I can not figure out the most efficient way.
> There is a RDD containing list.
>
> RDD
> (
>  List(1,2,3,4,5)
>  List(6,7,8,9,10)
> )
>
> I want to transform this to
>
> RDD
> (
> List(1,6)
> List(2,7)
> List(3,8)
> List(4,9)
> List(5,10)
> )
>
> And I want to achieve this without using collect method because realworld
> RDD can have a lot of elements then it may cause out of memory.
> Any ideas will be welcome.
>
> Best regards
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Transform-RDD-List-tp11948.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread Jiusheng Chen
How about increase HDFS file extent size? like current value is 128M, we
make it 512M or bigger.


On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong  wrote:

> Hi all,
>
> We are trying to use Spark MLlib to train super large data (100M features
> and 5B rows). The input data in HDFS has ~26K partitions. By default, MLlib
> will create a task for every partition at each iteration. But because our
> dimensions are also very high, such large number of tasks will increase
> large network overhead to transfer the weight vector. So we want to reduce
> the number of tasks, we tried below ways:
>
> 1. Coalesce partitions without shuffling, then cache.
>
> data.coalesce(numPartitions).cache()
>
> This works fine for relative small data, but when data is increasing and
> numPartitions is fixed, the size of one partition will be large. This
> introduces two issues: the first is, the larger partition will need larger
> object and more memory at runtime, and trigger GC more frequently; the
> second is, we meet the issue 'size exceeds integer.max_value' error, which
> seems be caused by the size of one partition larger than 2G (
> https://issues.apache.org/jira/browse/SPARK-1391).
>
> 2. Coalesce partitions with shuffling, then cache.
>
> data.coalesce(numPartitions, true).cache()
>
> It could mitigate the second issue in #1 at some degree, but fist issue is
> still there, and it also will introduce large amount of shullfling.
>
> 3. Cache data first, and coalesce partitions.
>
> data.cache().coalesce(numPartitions)
>
> In this way, the number of cached partitions is not change, but each task
> read the data from multiple partitions. However, I find the task will loss
> locality by this way. I find a lot of 'ANY' tasks, that means that tasks
> read data from other nodes, and become slower than that read data from
> local memory.
>
> I think the best way should like #3, but leverage locality as more as
> possible. Is there any way to do that? Any suggestions?
>
> Thanks!
>
> --
> ZHENG, Xu-dong
>
>


Support for ORC Table in Shark/Spark

2014-08-11 Thread vinay . kashyap



Hi all,
Is it possible to use table with ORC format in Shark
version 0.9.1 with Spark 0.9.2 and Hive version 0.12.0..??
I have
tried creating the ORC table in Shark using the below
query
create table orc_table (x int, y string) stored as
orc
create table works, but when I try to insert values
from a text table containing 2 rows
insert into table
orc_table select * from text_table;
I get the below
exception

org.apache.spark.SparkException: Job
aborted: Task 3.0:1 failed 4 times (most recent failure: Exception
failure: org.apache.hadoop.hive.ql.metadata.HiveException:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
Failed to create file
[/tmp/hive-windfarm/hive_2014-08-08_10-11-21_691_1945292644101251597/_task_tmp.-ext-1/_tmp.01_0]
for [DFSClient_attempt_201408081011__m_01_0_-341065575_80] on
client [], because this file is already being created
by [DFSClient_attempt_201408081011__m_01_0_82854889_71] on
[192.168.22.40]

        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)

        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2306)

        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2235)

        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2188)

        at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:505)

        at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354)

        at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

        at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)

        at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)

        at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)

        at
java.security.AccessController.doPrivileged(Native Method)

        at
javax.security.auth.Subject.doAs(Subject.java:415)

        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)

        at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)

)

        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)

        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)

        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

        at
scala.Option.foreach(Option.scala:236)

        at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)

        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)

        at
akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

        at
akka.actor.ActorCell.invoke(ActorCell.scala:456)

        at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

        at
akka.dispatch.Mailbox.run(Mailbox.scala:219)

        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

FAILED: Execution Error, return code -101
from shark.execution.SparkTask

 

Any idea how to overcome this..??

 

 

 

Thanks and regards

Vinay Kashyap



How to save mllib model to hdfs and reload it

2014-08-11 Thread XiaoQinyu
hello:

I want to know,if I use history data to training model and I want to use
this model in other app.How should I do?

Should I save this model in disk? And when I use this model then load it
from disk.But I don't know how to save the mllib model,and reload it?

I will be very pleasure,if anyone can give some tips.

Thanks

XiaoQinyu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Mllib : Save SVM model to disk

2014-08-11 Thread XiaoQinyu
Have you solved this problem??

And could you share how to save model to hdfs and reload it?

Thanks

XiaoQinyu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Save-SVM-model-to-disk-tp74p11954.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread ZHENG, Xu-dong
I think this has the same effect and issue with #1, right?


On Tue, Aug 12, 2014 at 1:08 PM, Jiusheng Chen 
wrote:

> How about increase HDFS file extent size? like current value is 128M, we
> make it 512M or bigger.
>
>
> On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong 
> wrote:
>
>> Hi all,
>>
>> We are trying to use Spark MLlib to train super large data (100M features
>> and 5B rows). The input data in HDFS has ~26K partitions. By default, MLlib
>> will create a task for every partition at each iteration. But because our
>> dimensions are also very high, such large number of tasks will increase
>> large network overhead to transfer the weight vector. So we want to reduce
>> the number of tasks, we tried below ways:
>>
>> 1. Coalesce partitions without shuffling, then cache.
>>
>> data.coalesce(numPartitions).cache()
>>
>> This works fine for relative small data, but when data is increasing and
>> numPartitions is fixed, the size of one partition will be large. This
>> introduces two issues: the first is, the larger partition will need larger
>> object and more memory at runtime, and trigger GC more frequently; the
>> second is, we meet the issue 'size exceeds integer.max_value' error, which
>> seems be caused by the size of one partition larger than 2G (
>> https://issues.apache.org/jira/browse/SPARK-1391).
>>
>> 2. Coalesce partitions with shuffling, then cache.
>>
>> data.coalesce(numPartitions, true).cache()
>>
>> It could mitigate the second issue in #1 at some degree, but fist issue
>> is still there, and it also will introduce large amount of shullfling.
>>
>> 3. Cache data first, and coalesce partitions.
>>
>> data.cache().coalesce(numPartitions)
>>
>> In this way, the number of cached partitions is not change, but each task
>> read the data from multiple partitions. However, I find the task will loss
>> locality by this way. I find a lot of 'ANY' tasks, that means that tasks
>> read data from other nodes, and become slower than that read data from
>> local memory.
>>
>> I think the best way should like #3, but leverage locality as more as
>> possible. Is there any way to do that? Any suggestions?
>>
>> Thanks!
>>
>> --
>> ZHENG, Xu-dong
>>
>>
>


-- 
郑旭东
ZHENG, Xu-dong


Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
Not sure which stalled HDFS client issue your'e referring to, but there was
one fixed in Spark 1.0.2 that could help you out --
https://github.com/apache/spark/pull/1409.  I've still seen one related to
Configuration objects not being threadsafe though so you'd still need to
keep speculation on to fix that (SPARK-2546)

As it stands now, I can:

A) have speculation off, in which case I get random hangs for a variety of
reasons (your HDFS stall, my Configuration safety issue)

or

B) have speculation on, in which case I get random failures related to
LeaseExpiredExceptions and .../_temporary/... file doesn't exist exceptions.


Kind of a catch-22 -- there's no reliable way to run large jobs on Spark
right now!

I'm going to file a bug for the _temporary and LeaseExpiredExceptions as I
think these are widespread enough that we need a place to track a
resolution.


On Mon, Aug 11, 2014 at 9:08 AM, Chen Song  wrote:

> Andrew that is a good finding.
>
> Yes, I have speculative execution turned on, becauseI saw tasks stalled on
> HDFS client.
>
> If I turned off speculative execution, is there a way to circumvent the
> hanging task issue?
>
>
>
> On Mon, Aug 11, 2014 at 11:13 AM, Andrew Ash  wrote:
>
>> I've also been seeing similar stacktraces on Spark core (not streaming)
>> and have a theory it's related to spark.speculation being turned on.  Do
>> you have that enabled by chance?
>>
>>
>> On Mon, Aug 11, 2014 at 8:10 AM, Chen Song 
>> wrote:
>>
>>> Bill
>>>
>>> Did you get this resolved somehow? Anyone has any insight into this
>>> problem?
>>>
>>> Chen
>>>
>>>
>>> On Mon, Aug 11, 2014 at 10:30 AM, Chen Song 
>>> wrote:
>>>
 The exception was thrown out in application master(spark streaming
 driver) and the job shut down after this exception.


 On Mon, Aug 11, 2014 at 10:29 AM, Chen Song 
 wrote:

> I got the same exception after the streaming job runs for a while, The
> ERROR message was complaining about a temp file not being found in the
> output folder.
>
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job
> 140774430 ms.0
> java.io.FileNotFoundException: File
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
> does not exist.
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> On Fri, Jul 25, 2014 at 7:04 PM, Bill Jay 
> wrote:
>
>> I just saw another error 

Re: Transform RDD[List]

2014-08-11 Thread Kevin Jung
Hi ssimanta.
The first line creates RDD[Int], not RDD[List[Int]].
In case of List , I can not zip all list elements in RDD like a.zip(b) and I
can not use only tuple2 because realworld RDD has more List elements in
source RDD.
So I guess the expected result depends on the count of original Lists.
This problem is related to pivot table.

Thanks
Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Transform-RDD-List-tp11948p11957.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
Hi Chen,

Please see the bug I filed at
https://issues.apache.org/jira/browse/SPARK-2984 with the
FileNotFoundException on _temporary directory issue.

Andrew


On Mon, Aug 11, 2014 at 10:50 PM, Andrew Ash  wrote:

> Not sure which stalled HDFS client issue your'e referring to, but there
> was one fixed in Spark 1.0.2 that could help you out --
> https://github.com/apache/spark/pull/1409.  I've still seen one related
> to Configuration objects not being threadsafe though so you'd still need to
> keep speculation on to fix that (SPARK-2546)
>
> As it stands now, I can:
>
> A) have speculation off, in which case I get random hangs for a variety of
> reasons (your HDFS stall, my Configuration safety issue)
>
> or
>
> B) have speculation on, in which case I get random failures related to
> LeaseExpiredExceptions and .../_temporary/... file doesn't exist exceptions.
>
>
> Kind of a catch-22 -- there's no reliable way to run large jobs on Spark
> right now!
>
> I'm going to file a bug for the _temporary and LeaseExpiredExceptions as I
> think these are widespread enough that we need a place to track a
> resolution.
>
>
> On Mon, Aug 11, 2014 at 9:08 AM, Chen Song  wrote:
>
>> Andrew that is a good finding.
>>
>> Yes, I have speculative execution turned on, becauseI saw tasks stalled
>> on HDFS client.
>>
>> If I turned off speculative execution, is there a way to circumvent the
>> hanging task issue?
>>
>>
>>
>> On Mon, Aug 11, 2014 at 11:13 AM, Andrew Ash 
>> wrote:
>>
>>> I've also been seeing similar stacktraces on Spark core (not streaming)
>>> and have a theory it's related to spark.speculation being turned on.  Do
>>> you have that enabled by chance?
>>>
>>>
>>> On Mon, Aug 11, 2014 at 8:10 AM, Chen Song 
>>> wrote:
>>>
 Bill

 Did you get this resolved somehow? Anyone has any insight into this
 problem?

 Chen


 On Mon, Aug 11, 2014 at 10:30 AM, Chen Song 
 wrote:

> The exception was thrown out in application master(spark streaming
> driver) and the job shut down after this exception.
>
>
> On Mon, Aug 11, 2014 at 10:29 AM, Chen Song 
> wrote:
>
>> I got the same exception after the streaming job runs for a while,
>> The ERROR message was complaining about a temp file not being found in 
>> the
>> output folder.
>>
>> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job
>> 140774430 ms.0
>> java.io.FileNotFoundException: File
>> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>> does not exist.
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
>> at
>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
>> at
>> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
>> at
>> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
>> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>> at scala.util.Try$.apply(Try.scala:161)
>> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>> at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
>> 

Re: ClassNotFound exception on class in uber.jar

2014-08-11 Thread Akhil Das
This is how i used to do it:

  *// Create a list of jars*

> List jars =
> Lists.newArrayList("/home/akhld/mobi/localcluster/x/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar","ADD-All-The-Jars-Here
> ");



> *// Create a SparkConf*
> SparkConf spconf = new SparkConf();
> spconf.setMaster("local");
> spconf.setAppName("YourApp");
>
> spconf.setSparkHome("/home/akhld/mobi/localcluster/x/spark-0.9.1-bin-hadoop2");
> *spconf.setJars(jars.toArray(new String[jars.size()]));*
> spconf.set("spark.executor.memory", "1g");
>
*// Now create the
context.*

> JavaStreamingContext jsc = new JavaStreamingContext(spconf,new
> Duration(1));
>

Thanks
Best Regards


On Mon, Aug 11, 2014 at 9:36 PM, lbustelo  wrote:

> Not sure if this problem reached the Spark guys because it shows in Nabble
> that "This post has NOT been accepted by the mailing list yet".
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-td10613.html#a11902
>
> I'm resubmitting.
>
> Greetings,
>
> I'm currently building a "fat" or "uber" jar with dependencies using maven
> that a docker-ized Spark cluster (1 master 3 workers, version 1.0.0, scala
> 2.10.4) points to locally on the same VM. It seems that sometimes a
> particular class is found, and things are fine, and other times it is not.
> Doing a find on the jar affirms that it is actually there. I `setJars` with
> JavaStreamingContext.jarOfClass(the main class).
>
> I cannot say I know much about how the ClassPath mechanisms of Spark so I
> appreciate any and all suggestions to find out what exactly is happening.
>
> The exception is as follows:
>
> 14/07/24 18:48:52 INFO Executor: Sending result for 139 directly to driver
> 14/07/24 18:48:52 INFO Executor: Finished task ID 139
> 14/07/24 18:48:56 WARN BlockManager: Putting block input-0-1406227713800
> failed
> 14/07/24 18:48:56 ERROR BlockManagerWorker: Exception handling buffer
> message
> java.lang.ClassNotFoundException: com.cjm5325.MyProject.MyClass
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at
>
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60)
> at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> at
>
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:59)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:666)
> at
> org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:587)
> at
>
> org.apache.spark.storage.BlockManagerWorker.putBlock(BlockManagerWorker.scala:82)
> at
>
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:63)
> at
>
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at
>
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at
>
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
> at
> scala.collection.Travers

Re: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-11 Thread Sean Owen
It sounds like your data does not all have the same dimension? that's
a decent guess. Have a look at the assertions in this method.

On Tue, Aug 12, 2014 at 4:44 AM, Ge, Yao (Y.)  wrote:
> I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
>
> When I run the training I got the following exception:
>
> java.lang.IllegalArgumentException: requirement failed
>
> at scala.Predef$.require(Predef.scala:221)
>
> at
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
>
> at
> org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
>
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
>
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
>
> at
> org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
>
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
>
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at scala.collection.immutable.Range.foreach(Range.scala:141)
>
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at
> scala.collection.AbstractTraversable.map(Traversable.scala:105)
>
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
>
> at
> org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)
>
>
>
> What does this means? How do I troubleshoot this problem?
>
> Thanks.
>
>
>
> -Yao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: set SPARK_LOCAL_DIRS issue

2014-08-11 Thread Andrew Ash
// assuming Spark 1.0

Hi Baoqiang,

In my experience for the standalone cluster you need to set
SPARK_WORKER_DIR not SPARK_LOCAL_DIRS to control where shuffle files are
written.  I think this is a documentation issue that could be improved, as
http://spark.apache.org/docs/latest/spark-standalone.html suggests using
SPARK_LOCAL_DIRS for scratch, and I'm not sure that it actually does
anything.

Did you see anything in /mnt/data/tmp when you used SPARK_LOCAL_DIRS?

Cheers!
Andrew


On Sat, Aug 9, 2014 at 7:21 AM, Baoqiang Cao  wrote:

> Hi
>
> I’m trying to using a specific dir for spark working directory since I
> have limited space at /tmp. I tried:
> 1)
> export SPARK_LOCAL_DIRS=“/mnt/data/tmp”
> or 2)
> SPARK_LOCAL_DIRS=“/mnt/data/tmp” in spark-env.sh
>
> But neither worked, since the output of spark still saying
>
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial
> writes to file /tmp/spark-local-20140809134509-0502/34/shuffle_0_436_1
> java.io.FileNotFoundException:
> /tmp/spark-local-20140809134509-0502/34/shuffle_0_436_1 (No space left on
> device)
>
> anybody help with correctly setting up the “tmp” directory?
>
> Best,
> Baoqiang Cao
> Blog: http://baoqiang.org
> Email: bqcaom...@gmail.com
>
>
>
>
>


Serialization with com.twitter.chill.MeatLocker

2014-08-11 Thread jerryye
Hi,
I've been trying to use com.twitter.chill.MeatLocker to serialize a
third-party class. So far I'm having no luck and I'm still getting the
dreaded Task not Serializable error for org.ahocorasick.trie.Trie. Am I
doing something obviously wrong? 

Below is my test code that is failing:

import com.twitter.chill.MeatLocker
import org.ahocorasick.trie.Trie

def genMapper[A, B](f: A => B): A => B = {
  val locker = com.twitter.chill.MeatLocker(f)
  x => locker.get.apply(x)
}

val myTrie = new Trie().onlyWholeWords()

myTrie.addKeyword("foo")
myTrie.addKeyword("bar")

val samples = sc.parallelize(Array("foo word", "bar word", "baz word"))

samples.map(t => myTrie.parseText(t)).foreach(println)

And the error:
scala> samples.map(t => myTrie.parseText(t)).foreach(println)
14/08/11 23:53:40 INFO SparkContext: Starting job: foreach at :19
14/08/11 23:53:40 INFO DAGScheduler: Got job 0 (foreach at :19)
with 8 output partitions (allowLocal=false)
14/08/11 23:53:40 INFO DAGScheduler: Final stage: Stage 0(foreach at
:19)
14/08/11 23:53:40 INFO DAGScheduler: Parents of final stage: List()
14/08/11 23:53:40 INFO DAGScheduler: Missing parents: List()
14/08/11 23:53:40 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map
at :19), which has no missing parents
14/08/11 23:53:40 INFO DAGScheduler: Failed to run foreach at :19
org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException: org.ahocorasick.trie.Trie
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:772)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:715)
at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:699)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1203)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Serialization-with-com-twitter-chill-MeatLocker-tp11965.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org