How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-11 Thread Louis Hust
We want to extract data from mysql, and calculate in sparksql. The sql explain like below. REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema: struct +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false,

回复:Spark is only using one worker machine when more are available

2018-04-11 Thread 宋源栋
Hi 1. Spark version : 2.3.0 2. jdk: oracle jdk 1.8 3. os version: centos 6.8 4. spark-env.sh: null 5. spark session config: SparkSession.builder().appName("DBScale") .config("spark.sql.crossJoin.enabled", "true") .config("spark.sql.adaptive.enabled", "true")

Nullpointerexception error when in repartition

2018-04-11 Thread Junfeng Chen
I write a program to read some json data from kafka and purpose to save them to parquet file on hdfs. Here is my code: > JavaInputDstream stream = ... > JavaDstream rdd = stream.map... > rdd.repartition(taksNum).foreachRDD(VoldFunction stringjavardd->{ > Dataset df =

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread surender kumar
right, this is what I did when I said I tried to persist and create an RDD out of it to sample from. But how to do for each user?You have one rdd of users on one hand and rdd of items on the other. How to go from here? Am I missing something trivial?  On Thursday, 12 April, 2018, 2:10:51

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan
Here it is : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2991198123660769/823198936734135/866038034322120/latest.html On Wed, Apr 11, 2018 at 10:55 AM, Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Hi Shiyuan, > can you show

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread Matteo Cossu
Why broadcasting this list then? You should use an RDD or DataFrame. For example, RDD has a method sample() that returns a random sample from it. On 11 April 2018 at 22:34, surender kumar wrote: > I'm using pySpark. > I've list of 1 million items (all float values

Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread surender kumar
I'm using pySpark.I've list of 1 million items (all float values ) and 1 million users. for each user I want to sample randomly some items from the item list.Broadcasting the item list results in Outofmemory error on the driver, tried setting driver memory till 10G.  I tried to persist this

Re: Not able to access Pyspark into Jupyter notebook

2018-04-11 Thread Dylan Guedes
Well... could you post the log or any errors that occurs? I used this pyspark jupyter notebook and it worked great. On Wed, Apr 11, 2018 at 12:36 AM, @Nandan@ wrote: > Hi Users, > >

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Alessandro Solimando
Hi Shiyuan, can you show us the output of ¨explain¨ over df (as a last step)? On 11 April 2018 at 19:47, Shiyuan wrote: > Variable name binding is a python thing, and Spark should not care how the > variable is named. What matters is the dependency graph. Spark fails to >

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan
Variable name binding is a python thing, and Spark should not care how the variable is named. What matters is the dependency graph. Spark fails to handle this dependency graph correctly for which I am quite surprised: this is just a simple combination of three very common sql operations. On Tue,

Re: cache OS memory and spark usage of it

2018-04-11 Thread yncxcw
hi, Raúl (1)&(2) yes, the OS needs some pressure to release it. For example, if you have a total 16GB ram in your machine, then you read a file of 8GB and immediately close it. Noe the page cache would cache 8GB the file data. Then you start a program requesting memory from OS, the OS will

Re: Spark is only using one worker machine when more are available

2018-04-11 Thread Jhon Anderson Cardenas Diaz
Hi, could you please share the environment variables values that you are sending when you run the jobs, spark version, etc.. more details. Btw, you should take a look on SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES if you are using spark 2.0.0

Structured Streaming output a lot pieces of files with Append Mode

2018-04-11 Thread feng wang
Hi, I have seen the doc in Spark 2.2 about Structured Steaming > Append mode (default) - This is the default mode, where only the new rows > added to the Result Table since the last trigger will be outputted to the > sink. This is supported for only those queries where rows added to the > Result

Spark is only using one worker machine when more are available

2018-04-11 Thread 宋源栋
Hi all, I hava a standalone mode spark cluster without HDFS with 10 machines that each one has 40 cpu cores and 128G RAM. My application is a sparksql application that reads data from database "tpch_100g" in mysql and run tpch queries. When loading tables from myql to spark, I spilts the

Hot to filter the datatime in dataset with java code please?

2018-04-11 Thread 1427357...@qq.com
HI all, I want to filter the data by the datatime. In mysql, the colume is the DATETIME type, named A. I write my code like: import java.util.Date; newX.filter(newX.col("A").isNull().or(newX.col("A").lt(new Date(.show(); I got error : Exception in thread "main" java.lang.RuntimeException:

Re: Re: how to use the sql join in java please

2018-04-11 Thread 1427357...@qq.com
Hi yucai, It works well now. Thanks. 1427357...@qq.com From: Yu, Yucai Date: 2018-04-11 16:01 To: 1427357...@qq.com; spark?users Subject: Re: how to use the sql join in java please Do you really want to do a cartesian product on those two tables? If yes, you can set

Re: How to submit some code segment to existing SparkContext

2018-04-11 Thread Saisai Shao
Maybe you can try Livy (http://livy.incubator.apache.org/). Thanks Jerry 2018-04-11 15:46 GMT+08:00 杜斌 : > Hi, > > Is there any way to submit some code segment to the existing SparkContext? > Just like a web backend, send some user code to the Spark to run, but the > initial

Re: how to use the sql join in java please

2018-04-11 Thread Yu, Yucai
Do you really want to do a cartesian product on those two tables? If yes, you can set spark.sql.crossJoin.enabled=true. Thanks, Yucai From: "1427357...@qq.com" <1427357...@qq.com> Date: Wednesday, April 11, 2018 at 3:16 PM To: spark?users Subject: how to use the sql join

How to submit some code segment to existing SparkContext

2018-04-11 Thread 杜斌
Hi, Is there any way to submit some code segment to the existing SparkContext? Just like a web backend, send some user code to the Spark to run, but the initial SparkContext takes time, just want to execute some code or Spark Sql, and get the result quickly. Thanks, Bin

Re: Issue with map function in Spark 2.2.0

2018-04-11 Thread ayan guha
As the error says clearly, column FL Date has a different format that you are expecting. Modify you date format mask appropriately On Wed, 11 Apr 2018 at 5:12 pm, @Nandan@ wrote: > Hi , > I am not able to use .map function in Spark. > > My codes are as below :- >

how to use the sql join in java please

2018-04-11 Thread 1427357...@qq.com
Hi all, I write java code to join two table. My code looks like: SparkSession ss = SparkSession.builder().master("local[4]").appName("testSql").getOrCreate(); Properties properties = new Properties(); properties.put("user","A"); properties.put("password","B");

Issue with map function in Spark 2.2.0

2018-04-11 Thread @Nandan@
Hi , I am not able to use .map function in Spark. My codes are as below :- *1) Create Parse function:-* from datetime import datetime from collections import namedtuple fields = ('date','airline','flightnum','origin','dest','dep','dep_delay','arv','arv_delay','airtime','distance') Flight =

Re: cache OS memory and spark usage of it

2018-04-11 Thread Jose Raul Perez Rodriguez
it was helpful, Then, the OS needs to fill some pressure from the applications requesting memory to free some memory cache? Exactly under which circumstances the OS free that memory to give it to applications requesting it? I mean if the total memory is 16GB and 10GB are used for OS cache,

How do I implement forEachWriter in structured streaming so that the connection is created once per partition?

2018-04-11 Thread SRK
Hi, How do I implement forEachWriter in structured streaming so that the connect is created once per partition and updates are done in a batch just like forEachPartition in RDDs? Thanks for the help! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Does structured streaming support Spark Kafka Direct?

2018-04-11 Thread SRK
hi, We have code based on Spark Kafka Direct in production and we want to port this code to Structured Streaming. Does structured streaming support spark kafka direct? What are the configs for parallelism and scalability in structured streaming? In Spark Kafka Direct, the number of kafka