Create an external table with DataFrameWriterV2

2023-09-19 Thread Christophe Préaud

Hi,

I usually create an external Delta table with the command below, using 
DataFrameWriter API:


df.write
   .format("delta")
   .option("path", "")
   .saveAsTable("")

Now I would like to use the DataFrameWriterV2 API.
I have tried the following command:

df.writeTo("")
   .using("delta")
   .option("path", "")
   .createOrReplace()

but it creates a managed table, not an external one.

Can you tell me the correct syntax for creating an external table with 
DataFrameWriterV2 API?


Thanks,
Christophe.


Re: How to convert a Dataset to a Dataset?

2022-06-06 Thread Christophe Préaud
Hi Marc,

I'm not much familiar with Spark on Java, but according to the doc 
,
 it should be:
Encoder stringEncoder = Encoders.STRING();
dataset.as(stringEncoder);


For the record, it is much simpler in Scala:
dataset.as[String]


Of course, this will work if your DataFrame only contains one column of type 
String, e.g.:
val df = spark.read.parquet("Cyrano_de_Bergerac_Acte_V.parquet")
df.printSchema

root
 |-- line: string (nullable = true)

df.as[String]


Otherwise, you will have to convert somehow the Row to a String, e.g. in Scala:
case class Data(f1: String, f2: Int, f3: Long)
val df = Seq(Data("a", 1, 1L), Data("b", 2, 2L), Data("c", 3, 3L), Data("d", 4, 
4L), Data("e", 5, 5L)).toDF
val ds = df.map(_.mkString(",")).as[String]
ds.show

+-+
|value|
+-+
|a,1,1|
|b,2,2|
|c,3,3|
|d,4,4|
|e,5,5|
+-+


Regards,
Christophe.

On 6/4/22 14:38, marc nicole wrote:
> Hi,
> How to convert a Dataset to a Dataset?
> What i have tried is:
>
> List list = dataset.as 
> (Encoders.STRING()).collectAsList(); Dataset 
> datasetSt = spark.createDataset(list, Encoders.STRING()); // But this line 
> raises a org.apache.spark.sql.AnalysisException: Try to map struct... to 
> Tuple1, but failed as the number of fields does not line up 
>
> Type of columns being String
> How to solve this?



Re: spark ETL and spark thrift server running together

2022-03-30 Thread Christophe Préaud
Hi Alex,

As stated in the Hive documentation 
(https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration):

*An embedded metastore database is mainly used for unit tests. Only one process 
can connect to the metastore database at a time, so it is not really a 
practical solution but works well for unit tests.*


You need to set up a remote metastore database (e.g. MariaDB / MySQL) for 
production use.

Regards,
Christophe.

On 3/30/22 13:31, Alex Kosberg wrote:
>
> Hi,
>
> Some details:
>
> · Spark SQL (version 3.2.1)
>
> · Driver: Hive JDBC (version 2.3.9)
>
> · ThriftCLIService: Starting ThriftBinaryCLIService on port 1 
> with 5...500 worker threads
>
> · BI tool is connect via odbc driver
>
> After activating Spark Thrift Server I'm unable to run pyspark script using 
> spark-submit as they both use the same metastore_db
>
> error:
>
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@3acaa384, see 
> the next exception for details.
>
>     at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>
>     at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>
>     ... 140 more
>
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /tmp/metastore_db.
>
>  
>
> I need to be able to run PySpark (Spark ETL) while having spark thrift server 
> up for BI tool queries. Any workaround for it?
>
> Thanks!
>
>  
>
>
> Notice: This e-mail together with any attachments may contain information of 
> Ribbon Communications Inc. and its Affiliates that is confidential and/or 
> proprietary for the sole use of the intended recipient. Any review, 
> disclosure, reliance or distribution by others or forwarding without express 
> permission is strictly prohibited. If you are not the intended recipient, 
> please notify the sender immediately and then delete all copies, including 
> any attachments.



Re: How to convert RDD to DF for this case -

2017-02-17 Thread Christophe Préaud
Hi Aakash,

You can try this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}

val header = Array("col1", "col2", "col3", "col4")
val schema = StructType(header.map(StructField(_, StringType, true)))

val statRow = stat.map(line => Row(line.split("\t"):_*))
val df = spark.createDataFrame(statRow, schema)

df.show
+--+--+++
|  col1|  col2|col3|col4|
+--+--+++
| uihgf| Paris|  56|   5|
|asfsds|   ***|  43|   1|
|fkwsdf|London|  45|   6|
|  gddg|  ABCD|  32|   2|
| grgzg|  *CSD|  35|   3|
| gsrsn|  ADR*|  22|   4|
+--+--+++


Please let me know if this works for you.

Regards,
Christophe.

On 17/02/17 10:37, Aakash Basu wrote:
Hi all,

Without using case class I tried making a DF to work on the join and other 
filtration later. But I'm getting an ArrayIndexOutOfBoundException error while 
doing a show of the DF.

1)  Importing SQLContext=
import org.apache.spark.sql.SQLContext._
import org.apache.spark.sql.SQLContext

2)  Initializing SQLContext=
val sqlContext = new SQLContext(sc)

3)  Importing implicits package for toDF conversion=
import sqlContext.implicits._

4)  Reading the Station and Storm Files=
val stat = sc.textFile("/user/root/spark_demo/scala/data/Stations.txt")
val stor = sc.textFile("/user/root/spark_demo/scala/data/Storms.txt")



stat.foreach(println)

uihgf   Paris   56   5
asfsds   ***   43   1
fkwsdf   London   45   6
gddg   ABCD   32   2
grgzg   *CSD   35   3
gsrsn   ADR*   22   4


5) Creating row by segregating columns after reading the tab delimited file 
before converting into DF=

val stati = stat.map(x => (x.split("\t")(0), x.split("\t")(1), 
x.split("\t")(2),x.split("\t")(3)))


6)  Converting into DF=
val station = stati.toDF()

station.show is giving the below error ->

17/02/17 08:46:35 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 1


Please help!

Thanks,
Aakash.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Partition n keys into exacly n partitions

2016-09-13 Thread Christophe Préaud
Hi,

A custom partitioner is indeed the solution.

Here is a sample code:
import org.apache.spark.Partitioner

class KeyPartitioner(keyList: Seq[Any]) extends Partitioner {

  def numPartitions: Int = keyList.size + 1

  def getPartition(key: Any): Int = keyList.indexOf(key) + 1

  override def equals(other: Any): Boolean = other match {
case h: KeyPartitioner =>
  h.numPartitions == numPartitions
case _ =>
  false
  }

  override def hashCode: Int = numPartitions
}


It allows to repartition a RDD[(K, V)] so that all lines with the same
key value (and only those lines) will be on the same partition.

You need to pass as parameter to the constructor a Seq[K] keyList containing 
all the possible values for the keys in the RDD[(K, V)], e.g.:
val rdd = sc.parallelize(
  Seq((1,'a),(2,'a),(3,'a),(1,'b),(2,'b),(1,'c),(3,'c),(4,'d))
)
rdd.partitionBy(new KeyPartitioner(Seq(1,2,3,4)))

will put:
- (1,'a) (1,'b) and (1,'c) in partition 1
- (2,'a) and (2,'b) in partition 2
- (3,'a) and (3,'c) in partition 3
- (4,'d) in partition 4
and nothing in partition 0

If a key is not defined in the keyList, it will be put in partition 0:
rdd.partitionBy(new KeyPartitioner(Seq(1,2,3)))
will put:
- (1,'a) (1,'b) and (1,'c) in partition 1
- (2,'a) and (2,'b) in partition 2
- (3,'a) and (3,'c) in partition 3
- (4,'d) in partition 0


Please let me know if it fits your needs.

Regards,
Christophe.

On 12/09/16 19:03, Denis Bolshakov wrote:
Just provide own partitioner.

One I wrote a partitioner which keeps similar keys together in one  partitioner.

Best regards,
Denis

On 12 September 2016 at 19:44, sujeet jog 
> wrote:
Hi,

Is there a way to partition set of data with n keys into exactly n partitions.

For ex : -

tuple of 1008 rows with key as x
tuple of 1008 rows with key as y   and so on  total 10 keys ( x, y etc )

Total records = 10080
NumOfKeys = 10

i want to partition the 10080 elements into exactly 10 partitions with each 
partition having elements with unique key

Is there a way to make this happen ?.. any ideas on implementing custom 
partitioner.


The current partitioner i'm using is HashPartitioner from which there are cases 
where key.hascode() % numPartitions  for keys of x & y become same.

 hence many elements with different keys fall into single partition at times.



Thanks,
Sujeet



--
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.de...@gmail.com



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Inode for STS

2016-07-13 Thread Christophe Préaud
Hi Ayan,

I have opened a JIRA about this issues, but there are no answer so far: 
SPARK-15401

Regards,
Christophe.

On 13/07/16 05:54, ayan guha wrote:
Hi

We are running Spark Thrift Server as a long running application. However,  it 
looks like it is filling up /tmp/hive folder with lots of small files and 
directories with no file in them, blowing out inode limit and preventing any 
connection with "No Space Left in Device" issue.

What is the best way to clean up those small files periodically?

--
Best Regards,
Ayan Guha



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: SparkSQL with large result size

2016-05-10 Thread Christophe Préaud
Hi,

You may be hitting this bug: 
SPARK-9879

In other words: did you try without the LIMIT clause?

Regards,
Christophe.

On 02/05/16 20:02, Gourav Sengupta wrote:
Hi,

I have worked on 300GB data by querying it  from CSV (using SPARK CSV)  and 
writing it to Parquet format and then querying parquet format to query it and 
partition the data and write out individual csv files without any issues on a 
single node SPARK cluster installation.

Are you trying to cache in the entire data? What is that you are trying to 
achieve in your used case?

Regards,
Gourav

On Mon, May 2, 2016 at 5:59 PM, Ted Yu 
> wrote:
That's my interpretation.

On Mon, May 2, 2016 at 9:45 AM, Buntu Dev 
<buntu...@gmail.com> 
wrote:
Thanks Ted, I thought the avg. block size was already low and less than the 
usual 128mb. If I need to reduce it further via parquet.block.size, it would 
mean an increase in the number of blocks and that should increase the number of 
tasks/executors. Is that the correct way to interpret this?

On Mon, May 2, 2016 at 6:21 AM, Ted Yu 
<yuzhih...@gmail.com> 
wrote:
Please consider decreasing block size.

Thanks

> On May 1, 2016, at 9:19 PM, Buntu Dev 
> <buntu...@gmail.com> 
> wrote:
>
> I got a 10g limitation on the executors and operating on parquet dataset with 
> block size 70M with 200 blocks. I keep hitting the memory limits when doing a 
> 'select * from t1 order by c1 limit 100' (ie, 1M). It works if I limit to 
> say 100k. What are the options to save a large dataset without running into 
> memory issues?
>
> Thanks!






Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Christophe Préaud
Hi,

Unless I've misunderstood what you want to achieve, you could use:
sqlContext.read.json(sc.textFile("/mnt/views-p/base/2016/01/*/*-xyz.json"))

Regards,
Christophe.

On 09/03/16 15:24, Ted Yu wrote:
Hadoop glob pattern doesn't support multi level wildcard.

Thanks

On Mar 9, 2016, at 6:15 AM, Koert Kuipers 
<ko...@tresata.com> wrote:

if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation 
handles globs

On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu 
> wrote:
This is currently not supported.

On Mar 9, 2016, at 4:38 AM, Jakub Liska 
<liska.ja...@gmail.com>
 wrote:

Hey,

is something like this possible?

sqlContext.read.json("/mnt/views-p/base/2016/01/*/*-xyz.json")

I switched to DataFrames because my source files changed from TSV to JSON
but now I'm not able to load the files as I did before. I get this error if I 
try that :

https://github.com/apache/spark/pull/9142#issuecomment-194248531




Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: local directories for spark running on yarn

2015-04-30 Thread Christophe Préaud
No, you should read:
if spark.local.dir is specified, spark.local.dir will be ignored.

This has been reworded (hopefully for the best) in 1.3.1: 
https://spark.apache.org/docs/1.3.1/running-on-yarn.html

Christophe.

On 17/04/2015 18:18, shenyanls wrote:
 According to the documentation:

 The local directories used by Spark executors will be the local directories
 configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the
 user specifies spark.local.dir, it will be ignored.
 (https://spark.apache.org/docs/1.2.1/running-on-yarn.html)

 If spark.local.dir is specified, the yarn local directory will be ignored,
 right? It's a little ambiguous to me.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/local-directories-for-spark-running-on-yarn-tp22543.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-27 Thread Christophe Préaud
Yes, spark.yarn.historyServer.address is used to access the spark history 
server from yarn, it is not needed if you use only the yarn history server.
It may be possible to have both history servers running, but I have not tried 
that yet.

Besides, as far as I have understood, yarn and spark history servers have two 
different purposes:
- yarn history server is for looking at your application logs after it has 
finished
- spark history server is for looking at your application in the spark web ui 
(the one with the Stages, Storage, Environment and Executors) after it 
has finished

Regards,
Christophe.

On 26/02/2015 20:30, Colin Kincaid Williams wrote:
 Right now I have set spark.yarn.historyServer.address in my spark configs to 
have yarn point to the spark-history server. Then from your mail it sounds like 
I should try another setting, or remove it completely. I also noticed that the 
aggregated log files appear in a directory in hdfs under application/spark vs. 
application/yarn or similar. I will review my configurations and see if I can 
get this working.

Thanks,

Colin Williams


On Thu, Feb 26, 2015 at 9:11 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
You can see this information in the yarn web UI using the configuration I 
provided in my former mail (click on the application id, then on logs; you will 
then be automatically redirected to the yarn history server UI).


On 24/02/2015 19:49, Colin Kincaid Williams wrote:
So back to my original question.

I can see the spark logs using the example above:

yarn logs -applicationId application_1424740955620_0009

This shows yarn log aggregation working. I can see the std out and std error in 
that container information above. Then how can I get this information in a 
web-ui ? Is this not currently supported?

On Tue, Feb 24, 2015 at 10:44 AM, Imran Rashid 
iras...@cloudera.commailto:iras...@cloudera.com wrote:
the spark history server and the yarn history server are totally independent.  
Spark knows nothing about yarn logs, and vice versa, so unfortunately there 
isn't any way to get all the info in one place.

On Tue, Feb 24, 2015 at 12:36 PM, Colin Kincaid Williams 
disc...@uw.edumailto:disc...@uw.edu wrote:
Looks like in my tired state, I didn't mention spark the whole time. However, 
it might be implied by the application log above. Spark log aggregation appears 
to be working, since I can run the yarn command above. I do have yarn logging 
setup for the yarn history server. I was trying to use the spark 
history-server, but maybe I should try setting

spark.yarn.historyServer.address

to the yarn history-server, instead of the spark history-server? I tried this 
configuration when I started, but didn't have much luck.

Are you getting your spark apps run in yarn client or cluster mode in your yarn 
history server? If so can you share any spark settings?

On Tue, Feb 24, 2015 at 8:48 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
Hi Colin,

Here is how I have configured my hadoop cluster to have yarn logs available 
through both the yarn CLI and the _yarn_ history server (with gzip compression 
and 10 days retention):

1. Add the following properties in the yarn-site.xml on each node managers and 
on the resource manager:
  property
nameyarn.log-aggregation-enable/name
valuetrue/value
  /property
  property
nameyarn.log-aggregation.retain-seconds/name
value864000/value
  /property
  property
nameyarn.log.server.url/name

valuehttp://dc1-kdp-dev-hadoop-03.dev.dc1.kelkoo.net:19888/jobhistory/logs/value
  /property
  property
nameyarn.nodemanager.log-aggregation.compression-type/name
valuegz/value
  /property

2. Restart yarn and then start the yarn history server on the server defined in 
the yarn.log.server.url property above:

/opt/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver # should fail if 
historyserver is not yet started
/opt/hadoop/sbin/stop-yarn.sh
/opt/hadoop/sbin/start-yarn.sh
/opt/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver


It may be slightly different for you if the resource manager and the history 
server are not on the same machine.

Hope it will work for you as well!
Christophe.

On 24/02/2015 06:31, Colin Kincaid Williams wrote:
 Hi,

 I have been trying to get my yarn logs to display in the spark history-server 
 or yarn history-server. I can see the log information


 yarn logs -applicationId application_1424740955620_0009
 15/02/23 22:15:14 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
 to us3sm2hbqa04r07-comp-prod-local


 Container: container_1424740955620_0009_01_02 on 
 us3sm2hbqa07r07.comp.prod.local_8041
 ===
 LogType: stderr
 LogLength: 0
 Log Contents:

 LogType: stdout
 LogLength: 897
 Log Contents:
 [GC [PSYoungGen: 262656K-23808K(306176K)] 262656K-23880K

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-26 Thread Christophe Préaud
You can see this information in the yarn web UI using the configuration I 
provided in my former mail (click on the application id, then on logs; you will 
then be automatically redirected to the yarn history server UI).

On 24/02/2015 19:49, Colin Kincaid Williams wrote:
So back to my original question.

I can see the spark logs using the example above:

yarn logs -applicationId application_1424740955620_0009

This shows yarn log aggregation working. I can see the std out and std error in 
that container information above. Then how can I get this information in a 
web-ui ? Is this not currently supported?

On Tue, Feb 24, 2015 at 10:44 AM, Imran Rashid 
iras...@cloudera.commailto:iras...@cloudera.com wrote:
the spark history server and the yarn history server are totally independent.  
Spark knows nothing about yarn logs, and vice versa, so unfortunately there 
isn't any way to get all the info in one place.

On Tue, Feb 24, 2015 at 12:36 PM, Colin Kincaid Williams 
disc...@uw.edumailto:disc...@uw.edu wrote:
Looks like in my tired state, I didn't mention spark the whole time. However, 
it might be implied by the application log above. Spark log aggregation appears 
to be working, since I can run the yarn command above. I do have yarn logging 
setup for the yarn history server. I was trying to use the spark 
history-server, but maybe I should try setting

spark.yarn.historyServer.address

to the yarn history-server, instead of the spark history-server? I tried this 
configuration when I started, but didn't have much luck.

Are you getting your spark apps run in yarn client or cluster mode in your yarn 
history server? If so can you share any spark settings?

On Tue, Feb 24, 2015 at 8:48 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
Hi Colin,

Here is how I have configured my hadoop cluster to have yarn logs available 
through both the yarn CLI and the _yarn_ history server (with gzip compression 
and 10 days retention):

1. Add the following properties in the yarn-site.xml on each node managers and 
on the resource manager:
  property
nameyarn.log-aggregation-enable/name
valuetrue/value
  /property
  property
nameyarn.log-aggregation.retain-seconds/name
value864000/value
  /property
  property
nameyarn.log.server.url/name

valuehttp://dc1-kdp-dev-hadoop-03.dev.dc1.kelkoo.net:19888/jobhistory/logs/value
  /property
  property
nameyarn.nodemanager.log-aggregation.compression-type/name
valuegz/value
  /property

2. Restart yarn and then start the yarn history server on the server defined in 
the yarn.log.server.url property above:

/opt/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver # should fail if 
historyserver is not yet started
/opt/hadoop/sbin/stop-yarn.sh
/opt/hadoop/sbin/start-yarn.sh
/opt/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver


It may be slightly different for you if the resource manager and the history 
server are not on the same machine.

Hope it will work for you as well!
Christophe.

On 24/02/2015 06:31, Colin Kincaid Williams wrote:
 Hi,

 I have been trying to get my yarn logs to display in the spark history-server 
 or yarn history-server. I can see the log information


 yarn logs -applicationId application_1424740955620_0009
 15/02/23 22:15:14 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
 to us3sm2hbqa04r07-comp-prod-local


 Container: container_1424740955620_0009_01_02 on 
 us3sm2hbqa07r07.comp.prod.local_8041
 ===
 LogType: stderr
 LogLength: 0
 Log Contents:

 LogType: stdout
 LogLength: 897
 Log Contents:
 [GC [PSYoungGen: 262656K-23808K(306176K)] 262656K-23880K(1005568K), 
 0.0283450 secs] [Times: user=0.14 sys=0.03, real=0.03 secs]
 Heap
  PSYoungGen  total 306176K, used 111279K [0xeaa8, 
 0x0001, 0x0001)
   eden space 262656K, 33% used 
 [0xeaa8,0xeffebbe0,0xfab0)
   from space 43520K, 54% used 
 [0xfab0,0xfc240320,0xfd58)
   to   space 43520K, 0% used 
 [0xfd58,0xfd58,0x0001)
  ParOldGen   total 699392K, used 72K [0xbff8, 
 0xeaa8, 0xeaa8)
   object space 699392K, 0% used 
 [0xbff8,0xbff92010,0xeaa8)
  PSPermGen   total 35328K, used 34892K [0xbad8, 
 0xbd00, 0xbff8)
   object space 35328K, 98% used 
 [0xbad8,0xbcf93088,0xbd00)



 Container: container_1424740955620_0009_01_03 on 
 us3sm2hbqa09r09.comp.prod.local_8041
 ===
 LogType: stderr
 LogLength: 0
 Log Contents:

 LogType: stdout
 LogLength: 896
 Log Contents:
 [GC [PSYoungGen: 262656K-23725K(306176K)] 262656K-23797K(1005568K), 
 0.0358650 secs

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Christophe Préaud
Hi Colin,

Here is how I have configured my hadoop cluster to have yarn logs available 
through both the yarn CLI and the _yarn_ history server (with gzip compression 
and 10 days retention):

1. Add the following properties in the yarn-site.xml on each node managers and 
on the resource manager:
  property
nameyarn.log-aggregation-enable/name
valuetrue/value
  /property
  property
nameyarn.log-aggregation.retain-seconds/name
value864000/value
  /property
  property
nameyarn.log.server.url/name

valuehttp://dc1-kdp-dev-hadoop-03.dev.dc1.kelkoo.net:19888/jobhistory/logs/value
  /property
  property
nameyarn.nodemanager.log-aggregation.compression-type/name
valuegz/value
  /property

2. Restart yarn and then start the yarn history server on the server defined in 
the yarn.log.server.url property above:

/opt/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver # should fail if 
historyserver is not yet started
/opt/hadoop/sbin/stop-yarn.sh
/opt/hadoop/sbin/start-yarn.sh
/opt/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver


It may be slightly different for you if the resource manager and the history 
server are not on the same machine.

Hope it will work for you as well!
Christophe.

On 24/02/2015 06:31, Colin Kincaid Williams wrote:
 Hi,

 I have been trying to get my yarn logs to display in the spark history-server 
 or yarn history-server. I can see the log information


 yarn logs -applicationId application_1424740955620_0009
 15/02/23 22:15:14 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
 to us3sm2hbqa04r07-comp-prod-local


 Container: container_1424740955620_0009_01_02 on 
 us3sm2hbqa07r07.comp.prod.local_8041
 ===
 LogType: stderr
 LogLength: 0
 Log Contents:

 LogType: stdout
 LogLength: 897
 Log Contents:
 [GC [PSYoungGen: 262656K-23808K(306176K)] 262656K-23880K(1005568K), 
 0.0283450 secs] [Times: user=0.14 sys=0.03, real=0.03 secs]
 Heap
  PSYoungGen  total 306176K, used 111279K [0xeaa8, 
 0x0001, 0x0001)
   eden space 262656K, 33% used 
 [0xeaa8,0xeffebbe0,0xfab0)
   from space 43520K, 54% used 
 [0xfab0,0xfc240320,0xfd58)
   to   space 43520K, 0% used 
 [0xfd58,0xfd58,0x0001)
  ParOldGen   total 699392K, used 72K [0xbff8, 
 0xeaa8, 0xeaa8)
   object space 699392K, 0% used 
 [0xbff8,0xbff92010,0xeaa8)
  PSPermGen   total 35328K, used 34892K [0xbad8, 
 0xbd00, 0xbff8)
   object space 35328K, 98% used 
 [0xbad8,0xbcf93088,0xbd00)



 Container: container_1424740955620_0009_01_03 on 
 us3sm2hbqa09r09.comp.prod.local_8041
 ===
 LogType: stderr
 LogLength: 0
 Log Contents:

 LogType: stdout
 LogLength: 896
 Log Contents:
 [GC [PSYoungGen: 262656K-23725K(306176K)] 262656K-23797K(1005568K), 
 0.0358650 secs] [Times: user=0.28 sys=0.04, real=0.04 secs]
 Heap
  PSYoungGen  total 306176K, used 65712K [0xeaa8, 
 0x0001, 0x0001)
   eden space 262656K, 15% used 
 [0xeaa8,0xed380bf8,0xfab0)
   from space 43520K, 54% used 
 [0xfab0,0xfc22b4f8,0xfd58)
   to   space 43520K, 0% used 
 [0xfd58,0xfd58,0x0001)
  ParOldGen   total 699392K, used 72K [0xbff8, 
 0xeaa8, 0xeaa8)
   object space 699392K, 0% used 
 [0xbff8,0xbff92010,0xeaa8)
  PSPermGen   total 29696K, used 29486K [0xbad8, 
 0xbca8, 0xbff8)
   object space 29696K, 99% used 
 [0xbad8,0xbca4b838,0xbca8)



 Container: container_1424740955620_0009_01_01 on 
 us3sm2hbqa09r09.comp.prod.local_8041
 ===
 LogType: stderr
 LogLength: 0
 Log Contents:

 LogType: stdout
 LogLength: 21
 Log Contents:
 Pi is roughly 3.1416

 I can see some details for the application in the spark history-server at 
 this url 
 http://us3sm2hbqa04r07.comp.prod.local:18080/history/application_1424740955620_0009/jobs/
  . When running in spark-master mode, I can see the stdout and stderror 
 somewhere in the spark history-server. Then how do I get the information 
 which I see above into the Spark history-server ?


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de 

Re: running spark project using java -cp command

2015-02-13 Thread Christophe Préaud
You can also export the variable SPARK_PRINT_LAUNCH_COMMAND before launching a 
spark-submit command to display the java command that will be launched, e.g.:

export SPARK_PRINT_LAUNCH_COMMAND=1
/opt/spark/bin/spark-submit --master yarn --deploy-mode cluster --class 
kelkoo.SparkAppTemplate --jars 
hdfs://prod-cluster/user/preaudc/jars/apps/joda-convert-1.6.jar,hdfs://prod-cluster/user/preaudc/jars/apps/joda-time-2.3.jar,hdfs://prod-cluster/user/preaudc/jars/apps/logReader-1.0.22.jar
 --driver-memory 512M --driver-library-path /opt/hadoop/lib/native 
--driver-class-path /usr/share/java/mysql-connector-java.jar --executor-memory 
1G --executor-cores 1 --queue spark-batch --num-executors 2 
hdfs://prod-cluster/user/preaudc/jars/apps/logProcessing-1.0.10.jar --log_dir 
/user/kookel/logs --country fr a b c
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/lib/jvm/java-openjdk/bin/java -cp 
:/usr/share/java/mysql-connector-java.jar:/opt/spark/conf:/opt/spark/lib/spark-assembly-hadoop.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.1.jar:/opt/spark/lib/datanucleus-core-3.2.2.jar:/opt/spark/lib/datanucleus-rdbms-3.2.1.jar:/etc/hadoop:/etc/hadoop
 -XX:MaxPermSize=128m -Djava.library.path=/opt/hadoop/lib/native -Xms512M 
-Xmx512M org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode 
cluster --class kelkoo.SparkAppTemplate --jars 
hdfs://prod-cluster/user/preaudc/jars/apps/joda-convert-1.6.jar,hdfs://prod-cluster/user/preaudc/jars/apps/joda-time-2.3.jar,hdfs://prod-cluster/user/preaudc/jars/apps/logReader-1.0.22.jar
 --driver-memory 512M --driver-library-path /opt/hadoop/lib/native 
--driver-class-path /usr/share/java/mysql-connector-java.jar --executor-memory 
1G --executor-cores 1 --queue spark-batch --num-executors 2 
hdfs://prod-cluster/user/preaudc/jars/apps/logProcessing-1.0.10.jar --log_dir 
/user
/kookel/logs --country fr a b c

(...)



Christophe.

On 10/02/2015 07:26, Akhil Das wrote:
Yes like this:

/usr/lib/jvm/java-7-openjdk-i386/bin/java -cp 
::/home/akhld/mobi/localcluster/spark-1/conf:/home/akhld/mobi/localcluster/spark-1/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-core-3.2.2.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-rdbms-3.2.1.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-api-jdo-3.2.1.jar
 -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
--class org.apache.spark.repl.Main spark-shell

It launches spark-shell.


Thanks
Best Regards

On Tue, Feb 10, 2015 at 11:36 AM, Hafiz Mujadid 
hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com wrote:
hi experts!

Is there any way to run spark application using java -cp command ?


thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/running-spark-project-using-java-cp-command-tp21567.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org





Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: How to set hadoop native library path in spark-1.1

2014-10-23 Thread Christophe Préaud
Hi,

Try the --driver-library-path option of spark-submit, e.g.:

/opt/spark/bin/spark-submit --driver-library-path /opt/hadoop/lib/native (...)

Regards,
Christophe.

On 21/10/2014 20:44, Pradeep Ch wrote:
 Hi all,

 Can anyone tell me how to set the native library path in Spark.

 Right not I am setting it using SPARK_LIBRARY_PATH environmental variable 
 in spark-env.sh. But still no success.

 I am still seeing this in spark-shell.

 NativeCodeLoader: Unable to load native-hadoop library for your platform... 
 using builtin-java classes where applicable


 Thanks,
 Pradeep


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark can't find jars

2014-10-16 Thread Christophe Préaud
Hi,

I have created a JIRA 
(SPARK-3967https://issues.apache.org/jira/browse/SPARK-3967), can you please 
confirm that you are hit by the same issue?

Thanks,
Christophe.

On 15/10/2014 09:49, Christophe Préaud wrote:
Hi Jimmy,
Did you try my patch?
The problem on my side was that the hadoop.tmp.dir  (in hadoop core-site.xml) 
was not handled properly by Spark when it is set on multiple partitions/disks, 
i.e.:

property
  namehadoop.tmp.dir/name
  
valuefile:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local/value
/property

Hence, you won't be hit by this bug if your hadoop.tmp.dir is set on one 
partition only.
If your hadoop.tmp.dir is also set on several partitions, I agree that it looks 
like a bug in Spark.

Christophe.

On 14/10/2014 18:50, Jimmy McErlain wrote:
So the only way that I could make this work was to build a fat jar file as 
suggested earlier.  To me (and I am no expert) it seems like this is a bug.  
Everything was working for me prior to our upgrade to Spark 1.1 on Hadoop 2.2 
but now it seems to not...  ie packaging my jars locally then pushing them out 
to the cluster and pointing them to corresponding dependent jars

Sorry I cannot be more help!
J
[https://mailfoogae.appspot.com/t?sender=aamltbXlAc2VsbHBvaW50cy5jb20%3Dtype=zerocontentguid=c1a21a6a-dbf9-453d-8c2a-b5e6a8d5ca56]ᐧ





JIMMY MCERLAIN

DATA SCIENTIST (NERD)

. . . . . . . . . . . . . . . . . .

[http://assetsw.sellpoint.net/IA/creative_services/logo_2014/sellpoints_logo_black_transparent_170x81.png]

IF WE CAN’T DOUBLE YOUR SALES,

ONE OF US IS IN THE WRONG BUSINESS.


E: ji...@sellpoints.commailto:ji...@sellpoints.com

M: 510.303.7751

On Tue, Oct 14, 2014 at 4:59 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
Hello,

I have already posted a message with the exact same problem, and proposed a 
patch (the subject is Application failure in yarn-cluster mode).
Can you test it, and see if it works for you?
I would be glad too if someone can confirm that it is a bug in Spark 1.1.0.

Regards,
Christophe.


On 14/10/2014 03:15, Jimmy McErlain wrote:
BTW this has always worked for me before until we upgraded the cluster to Spark 
1.1.1...
J
[https://mailfoogae.appspot.com/t?sender=aamltbXlAc2VsbHBvaW50cy5jb20%3Dtype=zerocontentguid=92430839-642b-4921-8d42-f266e48bcdfe]ᐧ





JIMMY MCERLAIN

DATA SCIENTIST (NERD)

. . . . . . . . . . . . . . . . . .

[http://assetsw.sellpoint.net/IA/creative_services/logo_2014/sellpoints_logo_black_transparent_170x81.png]

IF WE CAN’T DOUBLE YOUR SALES,

ONE OF US IS IN THE WRONG BUSINESS.


E: ji...@sellpoints.commailto:ji...@sellpoints.com

M: 510.303.7751tel:510.303.7751

On Mon, Oct 13, 2014 at 5:39 PM, HARIPRIYA AYYALASOMAYAJULA 
aharipriy...@gmail.commailto:aharipriy...@gmail.com wrote:
Helo,

Can you check if  the jar file is available in the target-scala-2.10 folder?

When you use sbt package to make the jar file, that is where the jar file would 
be located.

The following command works well for me:


spark-submit --class “Classname   --master yarn-cluster jarfile(withcomplete 
path)

Can you try checking  with this initially and later add other options?

On Mon, Oct 13, 2014 at 7:36 PM, Jimmy 
ji...@sellpoints.commailto:ji...@sellpoints.com wrote:
Having the exact same error with the exact same jar Do you work for 
Altiscale? :)
J

Sent from my iPhone

On Oct 13, 2014, at 5:33 PM, Andy Srine 
andy.sr...@gmail.commailto:andy.sr...@gmail.com wrote:


Hi Guys,


Spark rookie here. I am getting a file not found exception on the --jars. This 
is on the yarn cluster mode and I am running the following command on our 
recently upgraded Spark 1.1.1 environment.


./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class 
myEngine --driver-memory 1g --driver-library-path 
/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201406111750.jar 
--executor-memory 5g --executor-cores 5 --jars 
/home/andy/spark/lib/joda-convert-1.2.jar --queue default --num-executors 4 
/home/andy/spark/lib/my-spark-lib_1.0.jar


This is the error I am hitting. Any tips would be much appreciated. The file 
permissions looks fine on my local disk.


14/10/13 22:49:39 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with FAILED

14/10/13 22:49:39 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.

Exception in thread Driver java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure

Re: Application failure in yarn-cluster mode

2014-10-16 Thread Christophe Préaud
Hi,

I have been able to reproduce this problem on our dev environment, I am fairly 
sure now that it is indeed a bug.
As a consequence, I have created a JIRA 
(SPARK-3967https://issues.apache.org/jira/browse/SPARK-3967) for this issue, 
which is triggered when yarn.nodemanager.local-dirs (not hadoop.tmp.dir, as I 
said below) is set to a comma-separated list of directories which are located 
on different disks/partitions.

Regards,
Christophe.

On 14/10/2014 09:37, Christophe Préaud wrote:
Hi,

Sorry to insist, but I really feel like the problem described below is a bug in 
Spark.
Can anybody confirm if it is a bug, or a (configuration?) problem on my side?

Thanks,
Christophe.

On 10/10/2014 18:24, Christophe Préaud wrote:
Hi,

After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed 
most of the time (but not always) in yarn-cluster mode (but not in yarn-client 
mode).

Here is my configuration:

 *   spark-1.1.0
 *   hadoop-2.2.0

And the hadoop.tmp.dir definition in the hadoop core-site.xml file (each 
directory is on its own partition, on different disks):
property
  namehadoop.tmp.dir/name
  
valuefile:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local/value
/property

After investigating, it turns out that the problem is when the executor fetches 
a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local 
(see hadoop.tmp.dir definition above), and then moved in one of the temporary 
directory defined in hadoop.tmp.dir:

 *   if it is the same than the temporary file (i.e. /d1/yarn/local), then the 
application continues normally
 *   if it is another one (i.e. /d2/yarn/local, /d3/yarn/local,...), it fails 
with the following error:

14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 0)
java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at com.google.common.io.Files$FileByteSink.openStream(Files.java:223)
at com.google.common.io.Files$FileByteSink.openStream(Files.java:211)
at com.google.common.io.ByteSource.copyTo(ByteSource.java:203)
at com.google.common.io.Files.copy(Files.java:436)
at com.google.common.io.Files.move(Files.java:651)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I have no idea why the move fails when the source and target files are not on 
the same partition, for the moment I have worked around the problem with the 
attached patch (i.e. I ensure that the temp file and the moved file are always 
on the same partition).

Any thought about this problem?

Thanks!
Christophe.


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de

Re: Spark can't find jars

2014-10-15 Thread Christophe Préaud
Hi Jimmy,
Did you try my patch?
The problem on my side was that the hadoop.tmp.dir  (in hadoop core-site.xml) 
was not handled properly by Spark when it is set on multiple partitions/disks, 
i.e.:

property
  namehadoop.tmp.dir/name
  
valuefile:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local/value
/property

Hence, you won't be hit by this bug if your hadoop.tmp.dir is set on one 
partition only.
If your hadoop.tmp.dir is also set on several partitions, I agree that it looks 
like a bug in Spark.

Christophe.

On 14/10/2014 18:50, Jimmy McErlain wrote:
So the only way that I could make this work was to build a fat jar file as 
suggested earlier.  To me (and I am no expert) it seems like this is a bug.  
Everything was working for me prior to our upgrade to Spark 1.1 on Hadoop 2.2 
but now it seems to not...  ie packaging my jars locally then pushing them out 
to the cluster and pointing them to corresponding dependent jars

Sorry I cannot be more help!
J
[https://mailfoogae.appspot.com/t?sender=aamltbXlAc2VsbHBvaW50cy5jb20%3Dtype=zerocontentguid=c1a21a6a-dbf9-453d-8c2a-b5e6a8d5ca56]ᐧ





JIMMY MCERLAIN

DATA SCIENTIST (NERD)

. . . . . . . . . . . . . . . . . .

[http://assetsw.sellpoint.net/IA/creative_services/logo_2014/sellpoints_logo_black_transparent_170x81.png]

IF WE CAN’T DOUBLE YOUR SALES,

ONE OF US IS IN THE WRONG BUSINESS.


E: ji...@sellpoints.commailto:ji...@sellpoints.com

M: 510.303.7751

On Tue, Oct 14, 2014 at 4:59 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
Hello,

I have already posted a message with the exact same problem, and proposed a 
patch (the subject is Application failure in yarn-cluster mode).
Can you test it, and see if it works for you?
I would be glad too if someone can confirm that it is a bug in Spark 1.1.0.

Regards,
Christophe.


On 14/10/2014 03:15, Jimmy McErlain wrote:
BTW this has always worked for me before until we upgraded the cluster to Spark 
1.1.1...
J
[https://mailfoogae.appspot.com/t?sender=aamltbXlAc2VsbHBvaW50cy5jb20%3Dtype=zerocontentguid=92430839-642b-4921-8d42-f266e48bcdfe]ᐧ





JIMMY MCERLAIN

DATA SCIENTIST (NERD)

. . . . . . . . . . . . . . . . . .

[http://assetsw.sellpoint.net/IA/creative_services/logo_2014/sellpoints_logo_black_transparent_170x81.png]

IF WE CAN’T DOUBLE YOUR SALES,

ONE OF US IS IN THE WRONG BUSINESS.


E: ji...@sellpoints.commailto:ji...@sellpoints.com

M: 510.303.7751tel:510.303.7751

On Mon, Oct 13, 2014 at 5:39 PM, HARIPRIYA AYYALASOMAYAJULA 
aharipriy...@gmail.commailto:aharipriy...@gmail.com wrote:
Helo,

Can you check if  the jar file is available in the target-scala-2.10 folder?

When you use sbt package to make the jar file, that is where the jar file would 
be located.

The following command works well for me:


spark-submit --class “Classname   --master yarn-cluster jarfile(withcomplete 
path)

Can you try checking  with this initially and later add other options?

On Mon, Oct 13, 2014 at 7:36 PM, Jimmy 
ji...@sellpoints.commailto:ji...@sellpoints.com wrote:
Having the exact same error with the exact same jar Do you work for 
Altiscale? :)
J

Sent from my iPhone

On Oct 13, 2014, at 5:33 PM, Andy Srine 
andy.sr...@gmail.commailto:andy.sr...@gmail.com wrote:


Hi Guys,


Spark rookie here. I am getting a file not found exception on the --jars. This 
is on the yarn cluster mode and I am running the following command on our 
recently upgraded Spark 1.1.1 environment.


./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class 
myEngine --driver-memory 1g --driver-library-path 
/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201406111750.jar 
--executor-memory 5g --executor-cores 5 --jars 
/home/andy/spark/lib/joda-convert-1.2.jar --queue default --num-executors 4 
/home/andy/spark/lib/my-spark-lib_1.0.jar


This is the error I am hitting. Any tips would be much appreciated. The file 
permissions looks fine on my local disk.


14/10/13 22:49:39 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with FAILED

14/10/13 22:49:39 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.

Exception in thread Driver java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 
1.0 (TID 12, 122-67.vb2.company.comhttp://122-67.vb2.company.com): 
java.io.FileNotFoundException: ./joda-convert-1.2.jar (Permission

Re: Application failure in yarn-cluster mode

2014-10-14 Thread Christophe Préaud
Hi,

Sorry to insist, but I really feel like the problem described below is a bug in 
Spark.
Can anybody confirm if it is a bug, or a (configuration?) problem on my side?

Thanks,
Christophe.

On 10/10/2014 18:24, Christophe Préaud wrote:
Hi,

After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed 
most of the time (but not always) in yarn-cluster mode (but not in yarn-client 
mode).

Here is my configuration:

 *   spark-1.1.0
 *   hadoop-2.2.0

And the hadoop.tmp.dir definition in the hadoop core-site.xml file (each 
directory is on its own partition, on different disks):
property
  namehadoop.tmp.dir/name
  
valuefile:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local/value
/property

After investigating, it turns out that the problem is when the executor fetches 
a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local 
(see hadoop.tmp.dir definition above), and then moved in one of the temporary 
directory defined in hadoop.tmp.dir:

 *   if it is the same than the temporary file (i.e. /d1/yarn/local), then the 
application continues normally
 *   if it is another one (i.e. /d2/yarn/local, /d3/yarn/local,...), it fails 
with the following error:

14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 0)
java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at com.google.common.io.Files$FileByteSink.openStream(Files.java:223)
at com.google.common.io.Files$FileByteSink.openStream(Files.java:211)
at com.google.common.io.ByteSource.copyTo(ByteSource.java:203)
at com.google.common.io.Files.copy(Files.java:436)
at com.google.common.io.Files.move(Files.java:651)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I have no idea why the move fails when the source and target files are not on 
the same partition, for the moment I have worked around the problem with the 
attached patch (i.e. I ensure that the temp file and the moved file are always 
on the same partition).

Any thought about this problem?

Thanks!
Christophe.


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Spark can't find jars

2014-10-14 Thread Christophe Préaud
Hello,

I have already posted a message with the exact same problem, and proposed a 
patch (the subject is Application failure in yarn-cluster mode).
Can you test it, and see if it works for you?
I would be glad too if someone can confirm that it is a bug in Spark 1.1.0.

Regards,
Christophe.

On 14/10/2014 03:15, Jimmy McErlain wrote:
BTW this has always worked for me before until we upgraded the cluster to Spark 
1.1.1...
J
[https://mailfoogae.appspot.com/t?sender=aamltbXlAc2VsbHBvaW50cy5jb20%3Dtype=zerocontentguid=92430839-642b-4921-8d42-f266e48bcdfe]ᐧ





JIMMY MCERLAIN

DATA SCIENTIST (NERD)

. . . . . . . . . . . . . . . . . .

[http://assetsw.sellpoint.net/IA/creative_services/logo_2014/sellpoints_logo_black_transparent_170x81.png]

IF WE CAN’T DOUBLE YOUR SALES,

ONE OF US IS IN THE WRONG BUSINESS.


E: ji...@sellpoints.commailto:ji...@sellpoints.com

M: 510.303.7751

On Mon, Oct 13, 2014 at 5:39 PM, HARIPRIYA AYYALASOMAYAJULA 
aharipriy...@gmail.commailto:aharipriy...@gmail.com wrote:
Helo,

Can you check if  the jar file is available in the target-scala-2.10 folder?

When you use sbt package to make the jar file, that is where the jar file would 
be located.

The following command works well for me:


spark-submit --class “Classname   --master yarn-cluster jarfile(withcomplete 
path)

Can you try checking  with this initially and later add other options?

On Mon, Oct 13, 2014 at 7:36 PM, Jimmy 
ji...@sellpoints.commailto:ji...@sellpoints.com wrote:
Having the exact same error with the exact same jar Do you work for 
Altiscale? :)
J

Sent from my iPhone

On Oct 13, 2014, at 5:33 PM, Andy Srine 
andy.sr...@gmail.commailto:andy.sr...@gmail.com wrote:


Hi Guys,


Spark rookie here. I am getting a file not found exception on the --jars. This 
is on the yarn cluster mode and I am running the following command on our 
recently upgraded Spark 1.1.1 environment.


./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class 
myEngine --driver-memory 1g --driver-library-path 
/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201406111750.jar 
--executor-memory 5g --executor-cores 5 --jars 
/home/andy/spark/lib/joda-convert-1.2.jar --queue default --num-executors 4 
/home/andy/spark/lib/my-spark-lib_1.0.jar


This is the error I am hitting. Any tips would be much appreciated. The file 
permissions looks fine on my local disk.


14/10/13 22:49:39 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with FAILED

14/10/13 22:49:39 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.

Exception in thread Driver java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 
1.0 (TID 12, 122-67.vb2.company.comhttp://122-67.vb2.company.com): 
java.io.FileNotFoundException: ./joda-convert-1.2.jar (Permission denied)

java.io.FileOutputStream.open(Native Method)

java.io.FileOutputStream.init(FileOutputStream.java:221)

com.google.common.io.Files$FileByteSink.openStream(Files.java:223)

com.google.common.io.Files$FileByteSink.openStream(Files.java:211)



Thanks,
Andy




--
Regards,
Haripriya Ayyalasomayajula





Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Application failure in yarn-cluster mode

2014-10-10 Thread Christophe Préaud
Hi,

After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed 
most of the time (but not always) in yarn-cluster mode (but not in yarn-client 
mode).

Here is my configuration:

 *   spark-1.1.0
 *   hadoop-2.2.0

And the hadoop.tmp.dir definition in the hadoop core-site.xml file (each 
directory is on its own partition):
property
  namehadoop.tmp.dir/name
  
valuefile:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local/value
/property

After investigating, it turns out that the problem is when the executor fetches 
a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local 
(see hadoop.tmp.dir definition above), and then moved in one of the temporary 
directory defined in hadoop.tmp.dir:

 *   if it is the same than the temporary file (i.e. /d1/yarn/local), then the 
application continues normally
 *   if it is another one (i.e. /d2/yarn/local, /d3/yarn/local,...), it fails 
with the following error:

14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 0)
java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at com.google.common.io.Files$FileByteSink.openStream(Files.java:223)
at com.google.common.io.Files$FileByteSink.openStream(Files.java:211)
at com.google.common.io.ByteSource.copyTo(ByteSource.java:203)
at com.google.common.io.Files.copy(Files.java:436)
at com.google.common.io.Files.move(Files.java:651)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I have no idea why the move fails when the source and target files are not on 
the same partition, for the moment I have worked around the problem with the 
attached patch (i.e. I ensure that the temp file and the moved file are always 
on the same partition).

Any thought about this problem?

Thanks!
Christophe.


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.
--- core/src/main/scala/org/apache/spark/util/Utils.scala.orig	2014-09-03 08:00:33.0 +0200
+++ core/src/main/scala/org/apache/spark/util/Utils.scala	2014-10-10 17:51:59.0 +0200
@@ -349,8 +349,7 @@
*/
   def fetchFile(url: String, targetDir: File, conf: SparkConf, securityMgr: SecurityManager) {
 val filename = url.split(/).last
-val tempDir = getLocalDir(conf)
-val tempFile =  File.createTempFile(fetchFileTemp, null, new File(tempDir))
+val tempFile =  File.createTempFile(fetchFileTemp, null, new File(targetDir.getAbsolutePath))
 val targetFile = new File(targetDir, filename)
 val uri = new URI(url)
 val fileOverwrite = conf.getBoolean(spark.files.overwrite, false)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: write event logs with YARN

2014-07-04 Thread Christophe Préaud
Hi Andrew,

Thanks for your explanation, I confirm that the entries show up in the history 
server UI when I create empty APPLICATION_COMPLETE files for each of them.

Christophe.

On 03/07/2014 18:27, Andrew Or wrote:
Hi Christophe, another Andrew speaking.

Your configuration looks fine to me. From the stack trace it seems that we are 
in fact closing the file system pre-maturely elsewhere in the system, such that 
when it tries to write the APPLICATION_COMPLETE file it throws the exception 
you see. This does look like a potential bug in Spark. Tracing the source of 
this may take a little, but we will start looking into it.

I'm assuming if you manually create your own APPLICATION_COMPLETE file then the 
entries should show up. Unfortunately I don't see another workaround for this, 
but we'll fix this as soon as possible.

Andrew


2014-07-03 1:44 GMT-07:00 Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com:
Hi Andrew,

This does not work (the application failed), I have the following error when I 
put 3 slashes in the hdfs scheme:
(...)
Caused by: java.lang.IllegalArgumentException: Pathname 
/dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442http://dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
 from 
hdfs:/dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442http://dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
 is not a valid DFS filename.
(...)

Besides, I do not think that there is an issue with the hdfs path name since 
only the empty APPLICATION_COMPLETE file is missing (with 
spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events), 
all other files are correctly created, e.g.:
hdfs dfs -ls spark-events/kelkoo.searchkeywordreport-1404376178470
Found 3 items
-rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29 
spark-events/kelkoo.searchkeywordreport-1404376178470/COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
-rwxrwx---   1 kookel supergroup 137948 2014-07-03 08:32 
spark-events/kelkoo.searchkeywordreport-1404376178470/EVENT_LOG_2
-rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29 
spark-events/kelkoo.searchkeywordreport-1404376178470/SPARK_VERSION_1.0.0

You help is appreciated though, do not hesitate if you have any other idea on 
how to fix this.

Thanks,
Christophe.


On 03/07/2014 01:49, Andrew Lee wrote:
Hi Christophe,

Make sure you have 3 slashes in the hdfs scheme.

e.g.

hdfs:///server_name:9000/user/user_name/spark-events

and in the spark-defaults.conf as well.
spark.eventLog.dir=hdfs:///server_name:9000/user/user_name/spark-events


 Date: Thu, 19 Jun 2014 11:18:51 +0200
 From: christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com
 To: user@spark.apache.orgmailto:user@spark.apache.org
 Subject: write event logs with YARN

 Hi,

 I am trying to use the new Spark history server in 1.0.0 to view finished 
 applications (launched on YARN), without success so far.

 Here are the relevant configuration properties in my spark-defaults.conf:

 spark.yarn.historyServer.address=server_name:18080
 spark.ui.killEnabled=false
 spark.eventLog.enabled=true
 spark.eventLog.compress=true
 spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events

 And the history server has been launched with the command below:

 /opt/spark/sbin/start-history-server.sh 
 hdfs://server_name:9000/user/user_name/spark-events


 However, the finished application do not appear in the history server UI 
 (though the UI itself works correctly).
 Apparently, the problem is that the APPLICATION_COMPLETE file is not created:

 hdfs dfs -stat %n spark-events/application_name-1403166516102/*
 COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
 EVENT_LOG_2
 SPARK_VERSION_1.0.0

 Indeed, if I manually create an empty APPLICATION_COMPLETE file in the above 
 directory, the application can now be viewed normally in the history server.

 Finally, here is the relevant part of the YARN application log, which seems 
 to imply that
 the DFS Filesystem is already closed when the APPLICATION_COMPLETE file is 
 created:

 (...)
 14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with 
 SUCCEEDED
 14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal.
 14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1397477394591_0798
 14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from 
 shutdown hook
 14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at 
 http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877
 14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler
 14/06/19 08:29:29 INFO

Re: write event logs with YARN

2014-07-03 Thread Christophe Préaud
Hi Andrew,

This does not work (the application failed), I have the following error when I 
put 3 slashes in the hdfs scheme:
(...)
Caused by: java.lang.IllegalArgumentException: Pathname 
/dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
 from 
hdfs:/dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442
 is not a valid DFS filename.
(...)

Besides, I do not think that there is an issue with the hdfs path name since 
only the empty APPLICATION_COMPLETE file is missing (with 
spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events), 
all other files are correctly created, e.g.:
hdfs dfs -ls spark-events/kelkoo.searchkeywordreport-1404376178470
Found 3 items
-rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29 
spark-events/kelkoo.searchkeywordreport-1404376178470/COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
-rwxrwx---   1 kookel supergroup 137948 2014-07-03 08:32 
spark-events/kelkoo.searchkeywordreport-1404376178470/EVENT_LOG_2
-rwxrwx---   1 kookel supergroup  0 2014-07-03 08:29 
spark-events/kelkoo.searchkeywordreport-1404376178470/SPARK_VERSION_1.0.0

You help is appreciated though, do not hesitate if you have any other idea on 
how to fix this.

Thanks,
Christophe.

On 03/07/2014 01:49, Andrew Lee wrote:
Hi Christophe,

Make sure you have 3 slashes in the hdfs scheme.

e.g.

hdfs:///server_name:9000/user/user_name/spark-events

and in the spark-defaults.conf as well.
spark.eventLog.dir=hdfs:///server_name:9000/user/user_name/spark-events


 Date: Thu, 19 Jun 2014 11:18:51 +0200
 From: christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com
 To: user@spark.apache.orgmailto:user@spark.apache.org
 Subject: write event logs with YARN

 Hi,

 I am trying to use the new Spark history server in 1.0.0 to view finished 
 applications (launched on YARN), without success so far.

 Here are the relevant configuration properties in my spark-defaults.conf:

 spark.yarn.historyServer.address=server_name:18080
 spark.ui.killEnabled=false
 spark.eventLog.enabled=true
 spark.eventLog.compress=true
 spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events

 And the history server has been launched with the command below:

 /opt/spark/sbin/start-history-server.sh 
 hdfs://server_name:9000/user/user_name/spark-events


 However, the finished application do not appear in the history server UI 
 (though the UI itself works correctly).
 Apparently, the problem is that the APPLICATION_COMPLETE file is not created:

 hdfs dfs -stat %n spark-events/application_name-1403166516102/*
 COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
 EVENT_LOG_2
 SPARK_VERSION_1.0.0

 Indeed, if I manually create an empty APPLICATION_COMPLETE file in the above 
 directory, the application can now be viewed normally in the history server.

 Finally, here is the relevant part of the YARN application log, which seems 
 to imply that
 the DFS Filesystem is already closed when the APPLICATION_COMPLETE file is 
 created:

 (...)
 14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with 
 SUCCEEDED
 14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal.
 14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1397477394591_0798
 14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from 
 shutdown hook
 14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at 
 http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877
 14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler
 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all 
 executors
 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each executor to 
 shut down
 14/06/19 08:29:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
 stopped!
 14/06/19 08:29:30 INFO ConnectionManager: Selector thread was interrupted!
 14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped
 14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared
 14/06/19 08:29:30 INFO BlockManager: BlockManager stopped
 14/06/19 08:29:30 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
 14/06/19 08:29:30 INFO BlockManagerMaster: BlockManagerMaster stopped
 Exception in thread Thread-44 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1365)
 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1307)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:384)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:380)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 

Re: broadcast not working in yarn-cluster mode

2014-06-24 Thread Christophe Préaud
Hi again,

I've finally solved the problem below, it was due to an old 1.0.0-rc3 spark jar 
lying around in my .m2 directory which was used when I compiled  my spark 
applications (with maven).

Christophe.

On 20/06/2014 18:13, Christophe Préaud wrote:
 Hi,

 Since I migrated to spark 1.0.0, a couple of applications that used to work 
 in 0.9.1 now fail when broadcasting a variable.
 Those applications are run on a YARN cluster in yarn-cluster mode (and used 
 to run in yarn-standalone mode in 0.9.1)

 Here is an extract of the error log:

 Exception in thread Thread-3 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.SparkContext.broadcast(Ljava/lang/Object;)Lorg/apache/spark/broadcast/Broadcast;
 at 
 kelkoo.MerchantOffersPerformance$.main(MerchantOffersPerformance.scala:289)
 at 
 kelkoo.MerchantOffersPerformance.main(MerchantOffersPerformance.scala)

 Has anyone any idea how to solve this problem?

 Thanks,
 Christophe.

 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à l'attention 
 exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
 message, merci de le détruire et d'en avertir l'expéditeur.


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


broadcast not working in yarn-cluster mode

2014-06-20 Thread Christophe Préaud
Hi,

Since I migrated to spark 1.0.0, a couple of applications that used to work in 
0.9.1 now fail when broadcasting a variable.
Those applications are run on a YARN cluster in yarn-cluster mode (and used to 
run in yarn-standalone mode in 0.9.1)

Here is an extract of the error log:

Exception in thread Thread-3 java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.SparkContext.broadcast(Ljava/lang/Object;)Lorg/apache/spark/broadcast/Broadcast;
at 
kelkoo.MerchantOffersPerformance$.main(MerchantOffersPerformance.scala:289)
at 
kelkoo.MerchantOffersPerformance.main(MerchantOffersPerformance.scala)

Has anyone any idea how to solve this problem?

Thanks,
Christophe.

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


write event logs with YARN

2014-06-19 Thread Christophe Préaud
Hi,

I am trying to use the new Spark history server in 1.0.0 to view finished 
applications (launched on YARN), without success so far.

Here are the relevant configuration properties in my spark-defaults.conf:

spark.yarn.historyServer.address=server_name:18080
spark.ui.killEnabled=false
spark.eventLog.enabled=true
spark.eventLog.compress=true
spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events

And the history server has been launched with the command below:

/opt/spark/sbin/start-history-server.sh 
hdfs://server_name:9000/user/user_name/spark-events


However, the finished application do not appear in the history server UI 
(though the UI itself works correctly).
Apparently, the problem is that the APPLICATION_COMPLETE file is not created:

hdfs dfs -stat %n spark-events/application_name-1403166516102/*
COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
EVENT_LOG_2
SPARK_VERSION_1.0.0

Indeed, if I manually create an empty APPLICATION_COMPLETE file in the above 
directory, the application can now be viewed normally in the history server.

Finally, here is the relevant part of the YARN application log, which seems to 
imply that
the DFS Filesystem is already closed when the APPLICATION_COMPLETE file is 
created:

(...)
14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with SUCCEEDED
14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be 
successfully unregistered.
14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal.
14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory 
.sparkStaging/application_1397477394591_0798
14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from 
shutdown hook
14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at 
http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877
14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler
14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all 
executors
14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each executor to 
shut down
14/06/19 08:29:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
stopped!
14/06/19 08:29:30 INFO ConnectionManager: Selector thread was interrupted!
14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped
14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared
14/06/19 08:29:30 INFO BlockManager: BlockManager stopped
14/06/19 08:29:30 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
14/06/19 08:29:30 INFO BlockManagerMaster: BlockManagerMaster stopped
Exception in thread Thread-44 java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1365)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1307)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:384)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:380)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:380)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:324)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
at org.apache.spark.util.FileLogger.createWriter(FileLogger.scala:117)
at org.apache.spark.util.FileLogger.newFile(FileLogger.scala:181)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:129)
at 
org.apache.spark.SparkContext$$anonfun$stop$2.apply(SparkContext.scala:989)
at 
org.apache.spark.SparkContext$$anonfun$stop$2.apply(SparkContext.scala:989)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:989)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:443)
14/06/19 08:29:30 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.


Am I missing something, or is it a bug?

Thanks,
Christophe.

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-24 Thread Christophe Préaud

Good to know, thanks for pointing this out to me!

On 23/04/2014 19:55, Sandy Ryza wrote:
Ah, you're right about SPARK_CLASSPATH and ADD_JARS.  My bad.

SPARK_YARN_APP_JAR is going away entirely - 
https://issues.apache.org/jira/browse/SPARK-1053


On Wed, Apr 23, 2014 at 8:07 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
Hi Sandy,

Thanks for your reply !

I thought adding the jars in both SPARK_CLASSPATH and ADD_JARS was only 
required as a temporary workaround in spark 0.9.0 (see 
https://issues.apache.org/jira/browse/SPARK-1089), and that it was not 
necessary anymore in 0.9.1

As for SPARK_YARN_APP_JAR, is it really useful, or is it planned to be removed 
in future versions of Spark? I personally always set it to /dev/null when 
launching a spark-shell in yarn-client mode.

Thanks again for your time!
Christophe.


On 21/04/2014 19:16, Sandy Ryza wrote:
Hi Christophe,

Adding the jars to both SPARK_CLASSPATH and ADD_JARS is required.  The former 
makes them available to the spark-shell driver process, and the latter tells 
Spark to make them available to the executor processes running on the cluster.

-Sandy


On Wed, Apr 16, 2014 at 9:27 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
Hi,

I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the
correct way to add external jars when running a spark shell on a YARN cluster.

Packaging all this dependencies in an assembly which path is then set in
SPARK_YARN_APP_JAR (as written in the doc:
http://spark.apache.org/docs/latest/running-on-yarn.html) does not work in my
case: it pushes the jar on HDFS in .sparkStaging/application_XXX, but the
spark-shell is still unable to find it (unless ADD_JARS and/or SPARK_CLASSPATH
is defined)

Defining all the dependencies (either in an assembly, or separately) in ADD_JARS
or SPARK_CLASSPATH works (even if SPARK_YARN_APP_JAR is set to /dev/null), but
defining some dependencies in ADD_JARS and the rest in SPARK_CLASSPATH does not!

Hence I'm still wondering which are the differences between ADD_JARS and
SPARK_CLASSPATH, and the purpose of SPARK_YARN_APP_JAR.

Thanks for any insights!
Christophe.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.




Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.




Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-16 Thread Christophe Préaud

Hi,

I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the
correct way to add external jars when running a spark shell on a YARN cluster.

Packaging all this dependencies in an assembly which path is then set in
SPARK_YARN_APP_JAR (as written in the doc:
http://spark.apache.org/docs/latest/running-on-yarn.html) does not work in my
case: it pushes the jar on HDFS in .sparkStaging/application_XXX, but the
spark-shell is still unable to find it (unless ADD_JARS and/or SPARK_CLASSPATH
is defined)

Defining all the dependencies (either in an assembly, or separately) in ADD_JARS
or SPARK_CLASSPATH works (even if SPARK_YARN_APP_JAR is set to /dev/null), but
defining some dependencies in ADD_JARS and the rest in SPARK_CLASSPATH does not!

Hence I'm still wondering which are the differences between ADD_JARS and
SPARK_CLASSPATH, and the purpose of SPARK_YARN_APP_JAR.

Thanks for any insights!
Christophe.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: How to create a RPM package

2014-04-04 Thread Christophe Préaud

Hi Rahul,

Spark will be available in Fedora 21 (see: 
https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently 
scheduled on 2014-10-14 but they already have produced spec files and source 
RPMs.
If you are stuck with EL6 like me, you can have a look at the attached spec 
file, which you can probably adapt to your need.

Christophe.

On 04/04/2014 09:10, Rahul Singhal wrote:
Hello Community,

This is my first mail to the list and I have a small question. The maven build 
pagehttp://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages
 mentions a way to create a debian package but I was wondering if there is a simple 
way (preferably through maven) to create a RPM package. Is there a script (which is 
probably used for spark releases) that I can get my hands on? Or should I write one 
on my own?

P.S. I don't want to use the alien software to convert a debian package to a 
RPM.

Thanks,
Rahul Singhal



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.
Name: spark
Version:  0.9.0

# Build time settings
%global _full_version %{version}-incubating
%global _final_name %{name}-%{_full_version}
%global _spark_hadoop_version 2.2.0
%global _spark_dir /opt

Release:  2
Summary:  Lightning-fast cluster computing
Group:Development/Libraries
License:  ASL 2.0
URL:  http://spark.apache.org/
Source0:  http://www.eu.apache.org/dist/incubator/spark/%{_final_name}/%{_final_name}.tgz
BuildRequires: git
Requires:  /bin/bash
Requires:  /bin/sh
Requires:  /usr/bin/env

%description
Apache Spark is a fast and general engine for large-scale data processing.


%prep
%setup -q -n %{_final_name}


%build
SPARK_HADOOP_VERSION=%{_spark_hadoop_version} SPARK_YARN=true ./sbt/sbt assembly
find bin -type f -name '*.cmd' -exec rm -f {} \;


%install
mkdir -p ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/{conf,jars}
echo Spark %{_full_version} built for Hadoop %{_spark_hadoop_version}  ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/RELEASE
cp assembly/target/scala*/spark-assembly-%{_full_version}-hadoop%{_spark_hadoop_version}.jar ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/jars/spark-assembly-hadoop.jar
cp conf/*.template ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/conf
cp -r bin ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}
cp -r python ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}
cp -r sbin ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}


%files
%defattr(-,root,root,-)
%{_spark_dir}/%{name}

%changelog
* Mon Mar 31 2014 Christophe Préaud christophe.pre...@kelkoo.com 0.9.0-2
- Use description and Summary from Fedora RPM

* Wed Mar 26 2014 Christophe Préaud christophe.pre...@kelkoo.com 0.9.0-1
- first version with changelog :-)