Create an external table with DataFrameWriterV2

2023-09-19 Thread Christophe Préaud
Hi, I usually create an external Delta table with the command below, using DataFrameWriter API: df.write    .format("delta")    .option("path", "")    .saveAsTable("") Now I would like to use the DataFrameWriterV2 API. I have tried the following command: df.writeTo("")    .using("delta")    

Re: How to convert a Dataset to a Dataset?

2022-06-06 Thread Christophe Préaud
Hi Marc, I'm not much familiar with Spark on Java, but according to the doc , it should be: Encoder stringEncoder = Encoders.STRING(); dataset.as(stringEncoder); For the record, it is much simpler in Scala:

Re: spark ETL and spark thrift server running together

2022-03-30 Thread Christophe Préaud
Hi Alex, As stated in the Hive documentation (https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration): *An embedded metastore database is mainly used for unit tests. Only one process can connect to the metastore database at a time, so it is not really a

Re: How to convert RDD to DF for this case -

2017-02-17 Thread Christophe Préaud
Hi Aakash, You can try this: import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StringType, StructField, StructType} val header = Array("col1", "col2", "col3", "col4") val schema = StructType(header.map(StructField(_, StringType, true))) val statRow = stat.map(line =>

Re: Partition n keys into exacly n partitions

2016-09-13 Thread Christophe Préaud
Hi, A custom partitioner is indeed the solution. Here is a sample code: import org.apache.spark.Partitioner class KeyPartitioner(keyList: Seq[Any]) extends Partitioner { def numPartitions: Int = keyList.size + 1 def getPartition(key: Any): Int = keyList.indexOf(key) + 1 override def

Re: Inode for STS

2016-07-13 Thread Christophe Préaud
Hi Ayan, I have opened a JIRA about this issues, but there are no answer so far: SPARK-15401 Regards, Christophe. On 13/07/16 05:54, ayan guha wrote: Hi We are running Spark Thrift Server as a long running application. However, it looks

Re: SparkSQL with large result size

2016-05-10 Thread Christophe Préaud
Hi, You may be hitting this bug: SPARK-9879 In other words: did you try without the LIMIT clause? Regards, Christophe. On 02/05/16 20:02, Gourav Sengupta wrote: Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Christophe Préaud
Hi, Unless I've misunderstood what you want to achieve, you could use: sqlContext.read.json(sc.textFile("/mnt/views-p/base/2016/01/*/*-xyz.json")) Regards, Christophe. On 09/03/16 15:24, Ted Yu wrote: Hadoop glob pattern doesn't support multi level wildcard. Thanks On Mar 9, 2016, at 6:15 AM,

Re: local directories for spark running on yarn

2015-04-30 Thread Christophe Préaud
No, you should read: if spark.local.dir is specified, spark.local.dir will be ignored. This has been reworded (hopefully for the best) in 1.3.1: https://spark.apache.org/docs/1.3.1/running-on-yarn.html Christophe. On 17/04/2015 18:18, shenyanls wrote: According to the documentation: The

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-27 Thread Christophe Préaud
that the aggregated log files appear in a directory in hdfs under application/spark vs. application/yarn or similar. I will review my configurations and see if I can get this working. Thanks, Colin Williams On Thu, Feb 26, 2015 at 9:11 AM, Christophe Préaud christophe.pre

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-26 Thread Christophe Préaud
this configuration when I started, but didn't have much luck. Are you getting your spark apps run in yarn client or cluster mode in your yarn history server? If so can you share any spark settings? On Tue, Feb 24, 2015 at 8:48 AM, Christophe Préaud christophe.pre...@kelkoo.commailto:christophe.pre

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Christophe Préaud
Hi Colin, Here is how I have configured my hadoop cluster to have yarn logs available through both the yarn CLI and the _yarn_ history server (with gzip compression and 10 days retention): 1. Add the following properties in the yarn-site.xml on each node managers and on the resource manager:

Re: running spark project using java -cp command

2015-02-13 Thread Christophe Préaud
You can also export the variable SPARK_PRINT_LAUNCH_COMMAND before launching a spark-submit command to display the java command that will be launched, e.g.: export SPARK_PRINT_LAUNCH_COMMAND=1 /opt/spark/bin/spark-submit --master yarn --deploy-mode cluster --class kelkoo.SparkAppTemplate --jars

Re: How to set hadoop native library path in spark-1.1

2014-10-23 Thread Christophe Préaud
Hi, Try the --driver-library-path option of spark-submit, e.g.: /opt/spark/bin/spark-submit --driver-library-path /opt/hadoop/lib/native (...) Regards, Christophe. On 21/10/2014 20:44, Pradeep Ch wrote: Hi all, Can anyone tell me how to set the native library path in Spark. Right not I am

Re: Spark can't find jars

2014-10-16 Thread Christophe Préaud
Hi, I have created a JIRA (SPARK-3967https://issues.apache.org/jira/browse/SPARK-3967), can you please confirm that you are hit by the same issue? Thanks, Christophe. On 15/10/2014 09:49, Christophe Préaud wrote: Hi Jimmy, Did you try my patch? The problem on my side

Re: Application failure in yarn-cluster mode

2014-10-16 Thread Christophe Préaud
(not hadoop.tmp.dir, as I said below) is set to a comma-separated list of directories which are located on different disks/partitions. Regards, Christophe. On 14/10/2014 09:37, Christophe Préaud wrote: Hi, Sorry to insist, but I really feel like the problem described below is a bug in Spark. Can

Re: Spark can't find jars

2014-10-15 Thread Christophe Préaud
...@sellpoints.com M: 510.303.7751 On Tue, Oct 14, 2014 at 4:59 AM, Christophe Préaud christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote: Hello, I have already posted a message with the exact same problem, and proposed a patch (the subject is Application failure in yarn-cluster mode). Can

Re: Application failure in yarn-cluster mode

2014-10-14 Thread Christophe Préaud
Hi, Sorry to insist, but I really feel like the problem described below is a bug in Spark. Can anybody confirm if it is a bug, or a (configuration?) problem on my side? Thanks, Christophe. On 10/10/2014 18:24, Christophe Préaud wrote: Hi, After updating from spark-1.0.0 to spark-1.1.0, my

Re: Spark can't find jars

2014-10-14 Thread Christophe Préaud
Hello, I have already posted a message with the exact same problem, and proposed a patch (the subject is Application failure in yarn-cluster mode). Can you test it, and see if it works for you? I would be glad too if someone can confirm that it is a bug in Spark 1.1.0. Regards, Christophe. On

Application failure in yarn-cluster mode

2014-10-10 Thread Christophe Préaud
Hi, After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed most of the time (but not always) in yarn-cluster mode (but not in yarn-client mode). Here is my configuration: * spark-1.1.0 * hadoop-2.2.0 And the hadoop.tmp.dir definition in the hadoop core-site.xml

Re: write event logs with YARN

2014-07-04 Thread Christophe Préaud
a little, but we will start looking into it. I'm assuming if you manually create your own APPLICATION_COMPLETE file then the entries should show up. Unfortunately I don't see another workaround for this, but we'll fix this as soon as possible. Andrew 2014-07-03 1:44 GMT-07:00 Christophe Préaud

Re: write event logs with YARN

2014-07-03 Thread Christophe Préaud
Hi Andrew, This does not work (the application failed), I have the following error when I put 3 slashes in the hdfs scheme: (...) Caused by: java.lang.IllegalArgumentException: Pathname

Re: broadcast not working in yarn-cluster mode

2014-06-24 Thread Christophe Préaud
Hi again, I've finally solved the problem below, it was due to an old 1.0.0-rc3 spark jar lying around in my .m2 directory which was used when I compiled my spark applications (with maven). Christophe. On 20/06/2014 18:13, Christophe Préaud wrote: Hi, Since I migrated to spark 1.0.0

broadcast not working in yarn-cluster mode

2014-06-20 Thread Christophe Préaud
Hi, Since I migrated to spark 1.0.0, a couple of applications that used to work in 0.9.1 now fail when broadcasting a variable. Those applications are run on a YARN cluster in yarn-cluster mode (and used to run in yarn-standalone mode in 0.9.1) Here is an extract of the error log: Exception

write event logs with YARN

2014-06-19 Thread Christophe Préaud
Hi, I am trying to use the new Spark history server in 1.0.0 to view finished applications (launched on YARN), without success so far. Here are the relevant configuration properties in my spark-defaults.conf: spark.yarn.historyServer.address=server_name:18080 spark.ui.killEnabled=false

Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-24 Thread Christophe Préaud
, 2014 at 9:27 AM, Christophe Préaud christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote: Hi, I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the correct way to add external jars when running a spark shell on a YARN cluster. Packaging all this dependencies

SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-16 Thread Christophe Préaud
Hi, I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the correct way to add external jars when running a spark shell on a YARN cluster. Packaging all this dependencies in an assembly which path is then set in SPARK_YARN_APP_JAR (as written in the doc:

Re: How to create a RPM package

2014-04-04 Thread Christophe Préaud
Hi Rahul, Spark will be available in Fedora 21 (see: https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently scheduled on 2014-10-14 but they already have produced spec files and source RPMs. If you are stuck with EL6 like me, you can have a look at the attached spec file,