date:20180411

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-11 Thread Louis Hust

We want to extract data from mysql, and calculate in sparksql. The sql explain like below. REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema: struct +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false,

回复：Spark is only using one worker machine when more are available

2018-04-11 Thread 宋源栋

Hi 1. Spark version : 2.3.0 2. jdk: oracle jdk 1.8 3. os version: centos 6.8 4. spark-env.sh: null 5. spark session config: SparkSession.builder().appName("DBScale") .config("spark.sql.crossJoin.enabled", "true") .config("spark.sql.adaptive.enabled", "true")

Nullpointerexception error when in repartition

2018-04-11 Thread Junfeng Chen

I write a program to read some json data from kafka and purpose to save them to parquet file on hdfs. Here is my code: > JavaInputDstream stream = ... > JavaDstream rdd = stream.map... > rdd.repartition(taksNum).foreachRDD(VoldFunction stringjavardd->{ > Dataset df =

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread surender kumar

right, this is what I did when I said I tried to persist and create an RDD out of it to sample from. But how to do for each user?You have one rdd of users on one hand and rdd of items on the other. How to go from here? Am I missing something trivial? On Thursday, 12 April, 2018, 2:10:51

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan

Here it is : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2991198123660769/823198936734135/866038034322120/latest.html On Wed, Apr 11, 2018 at 10:55 AM, Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Hi Shiyuan, > can you show

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread Matteo Cossu

Why broadcasting this list then? You should use an RDD or DataFrame. For example, RDD has a method sample() that returns a random sample from it. On 11 April 2018 at 22:34, surender kumar wrote: > I'm using pySpark. > I've list of 1 million items (all float values

Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread surender kumar

I'm using pySpark.I've list of 1 million items (all float values ) and 1 million users. for each user I want to sample randomly some items from the item list.Broadcasting the item list results in Outofmemory error on the driver, tried setting driver memory till 10G. I tried to persist this

Re: Not able to access Pyspark into Jupyter notebook

2018-04-11 Thread Dylan Guedes

Well... could you post the log or any errors that occurs? I used this pyspark jupyter notebook and it worked great. On Wed, Apr 11, 2018 at 12:36 AM, @Nandan@ wrote: > Hi Users, > >

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Alessandro Solimando

Hi Shiyuan, can you show us the output of ¨explain¨ over df (as a last step)? On 11 April 2018 at 19:47, Shiyuan wrote: > Variable name binding is a python thing, and Spark should not care how the > variable is named. What matters is the dependency graph. Spark fails to >

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan

Variable name binding is a python thing, and Spark should not care how the variable is named. What matters is the dependency graph. Spark fails to handle this dependency graph correctly for which I am quite surprised: this is just a simple combination of three very common sql operations. On Tue,

Re: cache OS memory and spark usage of it

2018-04-11 Thread yncxcw

hi, Raúl (1)&(2) yes, the OS needs some pressure to release it. For example, if you have a total 16GB ram in your machine, then you read a file of 8GB and immediately close it. Noe the page cache would cache 8GB the file data. Then you start a program requesting memory from OS, the OS will

Re: Spark is only using one worker machine when more are available

2018-04-11 Thread Jhon Anderson Cardenas Diaz

Hi, could you please share the environment variables values that you are sending when you run the jobs, spark version, etc.. more details. Btw, you should take a look on SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES if you are using spark 2.0.0

Structured Streaming output a lot pieces of files with Append Mode

2018-04-11 Thread feng wang

Hi, I have seen the doc in Spark 2.2 about Structured Steaming > Append mode (default) - This is the default mode, where only the new rows > added to the Result Table since the last trigger will be outputted to the > sink. This is supported for only those queries where rows added to the > Result

Spark is only using one worker machine when more are available

2018-04-11 Thread 宋源栋

Hi all, I hava a standalone mode spark cluster without HDFS with 10 machines that each one has 40 cpu cores and 128G RAM. My application is a sparksql application that reads data from database "tpch_100g" in mysql and run tpch queries. When loading tables from myql to spark, I spilts the

Hot to filter the datatime in dataset with java code please?

2018-04-11 Thread 1427357...@qq.com

HI all, I want to filter the data by the datatime. In mysql, the colume is the DATETIME type, named A. I write my code like: import java.util.Date; newX.filter(newX.col("A").isNull().or(newX.col("A").lt(new Date(.show(); I got error : Exception in thread "main" java.lang.RuntimeException:

Re: Re: how to use the sql join in java please

2018-04-11 Thread 1427357...@qq.com

Hi yucai, It works well now. Thanks. 1427357...@qq.com From: Yu, Yucai Date: 2018-04-11 16:01 To: 1427357...@qq.com; spark?users Subject: Re: how to use the sql join in java please Do you really want to do a cartesian product on those two tables? If yes, you can set

Re: How to submit some code segment to existing SparkContext

2018-04-11 Thread Saisai Shao

Maybe you can try Livy (http://livy.incubator.apache.org/). Thanks Jerry 2018-04-11 15:46 GMT+08:00 杜斌 : > Hi, > > Is there any way to submit some code segment to the existing SparkContext? > Just like a web backend, send some user code to the Spark to run, but the > initial

Re: how to use the sql join in java please

2018-04-11 Thread Yu, Yucai

Do you really want to do a cartesian product on those two tables? If yes, you can set spark.sql.crossJoin.enabled=true. Thanks, Yucai From: "1427357...@qq.com" <1427357...@qq.com> Date: Wednesday, April 11, 2018 at 3:16 PM To: spark?users Subject: how to use the sql join

How to submit some code segment to existing SparkContext

2018-04-11 Thread 杜斌

Hi, Is there any way to submit some code segment to the existing SparkContext? Just like a web backend, send some user code to the Spark to run, but the initial SparkContext takes time, just want to execute some code or Spark Sql, and get the result quickly. Thanks, Bin

Re: Issue with map function in Spark 2.2.0

2018-04-11 Thread ayan guha

As the error says clearly, column FL Date has a different format that you are expecting. Modify you date format mask appropriately On Wed, 11 Apr 2018 at 5:12 pm, @Nandan@ wrote: > Hi , > I am not able to use .map function in Spark. > > My codes are as below :- >

how to use the sql join in java please

2018-04-11 Thread 1427357...@qq.com

Hi all, I write java code to join two table. My code looks like: SparkSession ss = SparkSession.builder().master("local[4]").appName("testSql").getOrCreate(); Properties properties = new Properties(); properties.put("user","A"); properties.put("password","B");

Issue with map function in Spark 2.2.0

2018-04-11 Thread @Nandan@

Hi , I am not able to use .map function in Spark. My codes are as below :- *1) Create Parse function:-* from datetime import datetime from collections import namedtuple fields = ('date','airline','flightnum','origin','dest','dep','dep_delay','arv','arv_delay','airtime','distance') Flight =

Re: cache OS memory and spark usage of it

2018-04-11 Thread Jose Raul Perez Rodriguez

it was helpful, Then, the OS needs to fill some pressure from the applications requesting memory to free some memory cache? Exactly under which circumstances the OS free that memory to give it to applications requesting it? I mean if the total memory is 16GB and 10GB are used for OS cache,

How do I implement forEachWriter in structured streaming so that the connection is created once per partition?

2018-04-11 Thread SRK

Hi, How do I implement forEachWriter in structured streaming so that the connect is created once per partition and updates are done in a batch just like forEachPartition in RDDs? Thanks for the help! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Does structured streaming support Spark Kafka Direct?

2018-04-11 Thread SRK

hi, We have code based on Spark Kafka Direct in production and we want to port this code to Structured Streaming. Does structured streaming support spark kafka direct? What are the configs for parallelism and scalability in structured streaming? In Spark Kafka Direct, the number of kafka

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

回复：Spark is only using one worker machine when more are available

Nullpointerexception error when in repartition

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

Broadcasting huge array or persisting on HDFS to read on executors - both not working

Re: Not able to access Pyspark into Jupyter notebook

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

Re: cache OS memory and spark usage of it

Re: Spark is only using one worker machine when more are available

Structured Streaming output a lot pieces of files with Append Mode

Spark is only using one worker machine when more are available

Hot to filter the datatime in dataset with java code please?

Re: Re: how to use the sql join in java please

Re: How to submit some code segment to existing SparkContext

Re: how to use the sql join in java please

How to submit some code segment to existing SparkContext

Re: Issue with map function in Spark 2.2.0

how to use the sql join in java please

Issue with map function in Spark 2.2.0

Re: cache OS memory and spark usage of it

How do I implement forEachWriter in structured streaming so that the connection is created once per partition?

Does structured streaming support Spark Kafka Direct?

25 matches

Site Navigation

Mail list logo

Footer information