Re: spark.speculation setting support on standalone mode?

2017-02-27 Thread Saisai Shao
I think it should be. These configurations doesn't depend on specific cluster manager use chooses. On Tue, Feb 28, 2017 at 4:42 AM, satishl wrote: > Are spark.speculation and related settings supported on standalone mode? > > > > -- > View this message in context:

Run spark machine learning example on Yarn failed

2017-02-27 Thread Yunjie Ji
After start the dfs, yarn and spark, I run these code under the root directory of spark on my master host: `MASTER=yarn ./bin/run-example ml.LogisticRegressionExample data/mllib/sample_libsvm_data.txt` Actually I get these code from spark's README. And here is the source code about

Error while enabling Hive Support in Spark 2.1

2017-02-27 Thread SRK
Hi, I have been trying to get my Spark job upgraded to 2.x. I see the following error. It seems to be looking for some global_temp database by default. Is this a behaviour of Spark 2.x that it looks for global_temp database by default? 17/02/27 16:59:09 INFO HiveMetaStore.audit: ugi=user1234

using spark to load a data warehouse in real time

2017-02-27 Thread Adaryl Wakefield
Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn't a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It's all turn of the century stuff and doesn't take into account

[Spark Kafka] API Doc pages for Kafka 0.10 not current

2017-02-27 Thread Afshartous, Nick
Hello, Looks like the API docs linked from the Spark Kafka 0.10 Integration page are not current. For instance, on the page https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html the code examples show the new API (i.e. class ConsumerStrategies). However,

spark.speculation setting support on standalone mode?

2017-02-27 Thread satishl
Are spark.speculation and related settings supported on standalone mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-speculation-setting-support-on-standalone-mode-tp28433.html Sent from the Apache Spark User List mailing list archive at

Re: SPark - YARN Cluster Mode

2017-02-27 Thread ayan guha
Hi Thanks a lot, i used property file to resolve the issue. I think documentation should mention it though. On Tue, 28 Feb 2017 at 5:05 am, Marcelo Vanzin wrote: > > none of my Config settings > > Is it none of the configs or just the queue? You can't set the YARN > queue

Re: How to set hive configs in Spark 2.1?

2017-02-27 Thread swetha kasireddy
Even the hive configurations like the following would work with this? sqlContext.setConf("hive.default.fileformat", "Orc") sqlContext.setConf("hive.exec.orc.memory.pool", "1.0") sqlContext.setConf("hive.optimize.sort.dynamic.partition", "true")

Re: Spark runs out of memory with small file

2017-02-27 Thread Henry Tremblay
Thanks! That works: def process_file(my_iter): the_id = "init" final = [] for chunk in my_iter: lines = chunk[1].split("\n") for line in lines: if line[0:15] == 'WARC-Record-ID:': the_id = line[15:] final.append(Row(the_id =

[Spark 2.1.0 ML] Serializing/Deserializing LocalLDA Problem

2017-02-27 Thread Benjamin Edwards
I am hoping someone can confirm this is a bug and/or provide a solution. I am trying to serialize an LDA model to disk for later use, but upon deserialization the model is not fully functional. In particular, transformation of data throws a NullPointerException. Here is a minimal example (just run

Re: SPark - YARN Cluster Mode

2017-02-27 Thread Marcelo Vanzin
> none of my Config settings Is it none of the configs or just the queue? You can't set the YARN queue in cluster mode through code, it has to be set in the command line. It's a chicken & egg problem (in cluster mode, the YARN app is created before your code runs). --property-file works the

Re: How to set hive configs in Spark 2.1?

2017-02-27 Thread neil90
All you need to do is - spark.conf.set("spark.sql.shuffle.partitions", 2000) spark.conf.set("spark.sql.orc.filterPushdown", True) ...etc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-hive-configs-in-Spark-2-1-tp28429p28431.html Sent from the

Re: Is there a list of missing optimizations for typed functions?

2017-02-27 Thread lihu
Hi, you can refer to https://issues.apache.org/jira/browse/SPARK-14083 for more detail. For performance issue,it is better to using the DataFrame than DataSet API. On Sat, Feb 25, 2017 at 2:45 AM, Jacek Laskowski wrote: > Hi Justin, > > I have never seen such a list. I think

How to set hive configs in Spark 2.1?

2017-02-27 Thread SRK
Hi, How to set the hive configurations in Spark 2.1? I have the following in 1.6. How to set the configs related to hive using the new SparkSession? sqlContext.sql(s"use ${HIVE_DB_NAME} ") sqlContext.setConf("hive.exec.dynamic.partition", "true")

Re: Structured Streaming: How to handle bad input

2017-02-27 Thread sasubillis
I think it is users responsibility to validate the input before feeding. https://databricks.gitbooks.io/databricks-spark-knowledge-base/best_practices/dealing_with_bad_data.html -- View this message in context:

Re: Get S3 Parquet File

2017-02-27 Thread Femi Anthony
Ok, thanks a lot for the heads up. Sent from my iPhone > On Feb 25, 2017, at 10:58 AM, Steve Loughran wrote: > > >> On 24 Feb 2017, at 07:47, Femi Anthony wrote: >> >> Have you tried reading using s3n which is a slightly older protocol ? I'm >>

handling dependency conflicts with spark

2017-02-27 Thread Mendelson, Assaf
Hi, I have a project which uses Jackson 2.8.5. Spark on the other hand seems to be using 2.6.5 I am using maven to compile. My original solution to the problem have been to set spark dependencies with the "provided" scope and use maven shade plugin to shade Jackson in my compilation. The

Re: Spark 2.1.0 issue with spark-shell and pyspark

2017-02-27 Thread romain.jouin
Hi, master = "spark://193.70.43.207:7077" appName = "romain2" spark = SparkSession.builder.master(master).appName(appName).getOrCreate() also gives me an error : IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':" Any way out ? -- View

Re: attempting to map Dataset[Row]

2017-02-27 Thread Yan Facai
Hi, Fletcher. case class can help construct complex structure. and also, RDD, StructType and StructureField are helpful if you need. However, the code is a little confusing, source.map{ row => { val key = row(0) val buff = new ArrayBuffer[Row]() buff += row (key,buff)

Re: Spark runs out of memory with small file

2017-02-27 Thread Henry Tremblay
This won't work: rdd2 = rdd.flatMap(splitf) rdd2.take(1) [u'WARC/1.0\r'] rdd2.count() 508310 If I then try to apply a map to rdd2, the map only works on each individual line. I need to create a state machine as in my second function. That is, I need to apply a key to each line, but the