Re: Weight column values not used in Binary Logistic Regression Summary

2017-12-09 Thread Sea aj
Hello everyone, I have a data frame which has two columns: ids and features each cell in feature column is an array of Vectors.dense type. like: [(DenseVector([0.5692]),), (DenseVector([0.5086]),)] I need to train a new model for every single row of my data frame. How can I do it? ‌ On

Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Sea aj
> > On 23 August 2017 at 14:27, Sea aj <saj3...@gmail.com> wrote: > >> Hi, >> >> I am trying to feed a huge dataframe to a ml algorithm in Spark but it >> crashes due to the shortage of memory. >> >> Is there a way to train the model on a subset

Training A ML Model on a Huge Dataframe

2017-08-23 Thread Sea aj
Hi, I am trying to feed a huge dataframe to a ml algorithm in Spark but it crashes due to the shortage of memory. Is there a way to train the model on a subset of the data in multiple steps? Thanks Sent with Mailtrack

Re: UI for spark machine learning.

2017-08-22 Thread Sea aj
st linear > regression? Probably your model would do equally well with much less > samples. Have you checked bias and variance if you use much less random > samples? > > On 22. Aug 2017, at 12:58, Sea aj <saj3...@gmail.com> wrote: > > I have a large dataframe of 1 billio

Re: UI for spark machine learning.

2017-08-22 Thread Sea aj
I have a large dataframe of 1 billion rows of type LabeledPoint. I tried to train a linear regression model on the df but it failed due to lack of memory although I'm using 9 slaves, each with 100gb of ram and 16 cores of CPU. I decided to split my data into multiple chunks and train the model in

Re: SPARK Issue in Standalone cluster

2017-08-22 Thread Sea aj
Hi everyone, I have a huge dataframe with 1 billion rows and each row is a nested list. That being said, I want to train some ML models on this df but due to the huge size, I get out memory error on one of my nodes when I run fit function. currently, my configuration is: 144 cores, 16 cores for

Reading csv.gz files

2017-07-05 Thread Sea aj
I need to import a set of files with csv.gz extension into Spark. each file contains a table of data. I was wondering if anyone knows how to read it? Sent with Mailtrack

How does Spark deal with Data Skewness?

2017-06-22 Thread Sea aj
Hi everyone, I have read about some interesting ideas on how to manage skew but I was not sure if any of these techniques are being used in Spark 2.x versions or not? To name a few, "Salting the Data" and "Dynamic Repartitioning" are techniques introduced in Spark Summits. I am really curious to

Re: SparkSQL not able to read a empty table location

2017-05-21 Thread Sea
please try spark.sql.hive.verifyPartitionPath true -- Original -- From: "Steve Loughran";; Date: Sat, May 20, 2017 09:19 PM To: "Bajpai, Amit X. -ND"; Cc: "user@spark.apache.org";

?????? How to specify file

2016-09-23 Thread Sea
??9??23??(??????) 3:32 ??: "Sea"<261810...@qq.com>; : "user"<user@spark.apache.org>; : Re: How to specify file Check out the READEME on the following page. This is the csv connector that you are using. I think you need to specify the delimiter

How to specify file

2016-09-23 Thread Sea
Hi, I want to run sql directly on files, I find that spark has supported sql like select * from csv.`/path/to/file`, but files may not be split by ','. Maybe it is split by '\001', how can I specify delimiter? Thank you!

?????? Spark hangs at "Removed broadcast_*"

2016-07-12 Thread Sea
please provide your jstack info. -- -- ??: "dhruve ashar";; : 2016??7??13??(??) 3:53 ??: "Anton Sviridov"; : "user"; : Re: Spark hangs at "Removed

?????? Bug about reading parquet files

2016-07-08 Thread Sea
Relation.(LogicalRelation.scala:37) -- -- ??: "lian.cs.zju";<lian.cs@gmail.com>; ????: 2016??7??8??(??) 4:47 ??: "Sea"<261810...@qq.com>; : "user"<user@spark.apache.org>; : Re: Bug about reading parquet files What's

Bug about reading parquet files

2016-07-08 Thread Sea
I have a problem reading parquet files. sql: select count(1) from omega.dwd_native where year='2016' and month='07' and day='05' and hour='12' and appid='6'; The hive partition is (year,month,day,appid) only two tasks, and it will list all directories in my table, not only

??????????: G1 GC takes too much time

2016-05-29 Thread Sea
Yes, It seems like that CMS is better. I have tried G1 as databricks' blog recommended, but it's too slow. -- -- ??: "condor join";; : 2016??5??30??(??) 10:17 ??: "Ted Yu"; :

??????spark sql on hive

2016-04-18 Thread Sea
It's a bug of hive. Please use hive metastore service instead of visiting mysql directly. set hive.metastore.uris in hive-site.xml -- -- ??: "Jieliang Li";; : 2016??4??19??(??) 12:55 ??:

?????? Limit pyspark.daemon threads

2016-03-18 Thread Sea
It's useless... The python worker will go above 1.5g in my production environment -- -- ??: "Ted Yu";; : 2016??3??17??(??) 10:50 ??: "Carlile, Ken"; :

?????? mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Sea
Hi,manas: Maybe you can look at this bug: https://issues.apache.org/jira/browse/SPARK-13566 -- -- ??: "manas kar";; : 2016??3??15??(??) 10:48 ??: "Ted Yu"; :

?????? Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Sea
Hi, Sumona: It's a bug in Spark old version, In spark 1.6.0, it is fixed. After the application complete, spark master will load event log to memory, and it is sync because of actor. If the event log is big, spark master will hang a long time, and you can not submit any applications,

Deadlock between UnifiedMemoryManager and BlockManager

2016-02-29 Thread Sea
Hi??all?? My spark version is 1.6.0, I found a deadlock in production environment, Anyone can help? I create an issue in jira: https://issues.apache.org/jira/browse/SPARK-13566 === "block-manager-slave-async-thread-pool-1": at

??????off-heap certain operations

2016-02-11 Thread Sea
spark.memory.offHeap.enabled (default is false) , it is wrong in spark docs. Spark1.6 do not recommend to use off-heap memory. -- -- ??: "Ovidiu-Cristian MARCU";; : 2016??2??12??(??) 5:51 ??:

?????? Shuffle memory woes

2016-02-07 Thread Sea
Hi??Corey?? "The dataset is 100gb at most, the spills can up to 10T-100T", Are your input files lzo format, and you use sc.text() ? If memory is not enough, spark will spill 3-4x of input data to disk. -- -- ??: "Corey

How to query data in tachyon with spark-sql

2016-01-20 Thread Sea
Hi,all I want to mount some hive table in tachyon, but I don't know how to query data in tachyon with spark-sql, who knows?

?????? How to use Java8

2016-01-05 Thread Sea
thanks -- -- ??: "Andy Davidson";<a...@santacruzintegration.com>; : 2016??1??6??(??) ????11:04 ??: "Sea"<261810...@qq.com>; "user"<user@spark.apache.org>; ????: Re: How to use Java8

How to use Java8

2016-01-05 Thread Sea
Hi, all I want to support java8, I use JDK1.8.0_65 in production environment, but it doesn't work. Should I build spark using jdk1.8, and set 1.8 in pom.xml? java.lang.UnsupportedClassVersionError: Unsupported major.minor version 52.

[Spark Streaming] Unable to write checkpoint when restart

2015-11-21 Thread Sea
When I restart my streaming program??this bug found And it will kill my program I am using spark 1.4.1 15/11/22 03:20:00 WARN CheckpointWriter: Error in attempt 1 of writing checkpoint to hdfs://streaming/user/dm/order_predict/streaming_ v2/10/checkpoint/checkpoint-144813360

Re?? About memory leak in spark 1.4.1

2015-08-05 Thread Sea
spark.worker.cleanup.enabled true spark.logConf true spark.rdd.compress true On 4 August 2015 at 12:59, Sea 261810...@qq.com wrote: How much machines are there in your standalone cluster? I am not using tachyon. GC can not help me... Can anyone help ? my configuration

Re?? About memory leak in spark 1.4.1

2015-08-04 Thread Sea
that indeed spark got your 50g per executor limit? I mean in configuration page.. might be you are using offheap storage(Tachyon)? On 3 August 2015 at 04:58, Sea 261810...@qq.com wrote: spark uses a lot more than heap memory, it is the expected behavior. It didn't exist in spark 1.3.x What does

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
that you MUST store in memory for performance, better give spark more space.. ended up setting it to 0.3 All that said, it is on spark 1.3 on cluster hope that helps On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote: Hi, all I upgrage spark to 1.4.1, many applications failed... I

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
mention spark.storage.memoryFraction in two places. One is under Cache Size Tuning section. FYI On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote: Hi, Barak It is ok with spark 1.3.0, the problem is with spark 1.4.1. I don't think spark.storage.memoryFraction will make any sense

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
Sea 261810...@qq.com wrote: spark.storage.memoryFraction is in heap memory, but my situation is that the memory is more than heap memory ! Anyone else use spark 1.4.1 in production? -- -- ??: Ted Yu;yuzhih...@gmail.com; : 2015??8??2

About memory leak in spark 1.4.1

2015-08-01 Thread Sea
Hi, all I upgrage spark to 1.4.1, many applications failed... I find the heap memory is not full , but the process of CoarseGrainedExecutorBackend will take more memory than I expect, and it will increase as time goes on, finally more than max limited of the server, the worker will die.

回复: Asked to remove non-existent executor exception

2015-07-26 Thread Sea
This exception is so ugly!!! The screen is full of these information when the program runs a long time, and they will not fail the job. I comment it in the source code. I think this information is useless because the executor is already removed and I don't know what does the executor id

About extra memory on yarn mode

2015-07-14 Thread Sea
Hi all: I have a question about why spark on yarn will need extra memory I apply for 10 executors, executor memory 6g, I find that it will allocate 1g more for 1 executor, totally 7g for 1 executor. I try to set spark.yarn.executor.memoryOverhead, but it did not help. 1g for 1 executor is too

[SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-07 Thread Sea
Hi, all I found an Exception when using spark-sql java.lang.UnsatisfiedLinkError: Native Library /data/lib/native/libgplcompression.so already loaded in another classloader ... I set spark.sql.hive.metastore.jars=. in file spark-defaults.conf It does not happen every time. Who knows

回复: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Sea
SPARK_CLASSPATH is nice, spark.jars needs to list all the jars one by one when submitting to yarn because spark.driver.classpath and spark.executor.classpath are not available in yarn mode. Can someone remove the warnning from the code or upload the jar in spark.driver.classpath and

?????? Time is ugly in Spark Streaming....

2015-06-27 Thread Sea
Das t...@databricks.com wrote: Could you print the time on the driver (that is, in foreachRDD but before RDD.foreachPartition) and see if it is behaving weird? TD On Fri, Jun 26, 2015 at 3:57 PM, Emrehan T??z??n emrehan.tu...@gmail.com wrote: On Fri, Jun 26, 2015 at 12:30 PM, Sea 261810

?????? Time is ugly in Spark Streaming....

2015-06-26 Thread Sea
26, 2015 at 11:06 AM, Sea 261810...@qq.com wrote: Hi, all I find a problem in spark streaming, when I use the time in function foreachRDD... I find the time is very interesting. val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet

How to use an different version of hive

2015-06-21 Thread Sea
Hi, all: We have an own version of hive 0.13.1, we alter the code about permissions of operating table and an issue of hive 0.13.1 HIVE-6131 Spark 1.4.0 support different versions of hive metastore, who can give an example? I am confused of these spark.sql.hive.metastore.jars

Re?? Abount Jobs UI in yarn-client mode

2015-06-21 Thread Sea
To yarn-site.xml. The problem solved. Spark 1.4 + Yarn 2.7 + Java 8 On Fri, Jun 19, 2015 at 8:48 AM, Sea 261810...@qq.com wrote: Hi, all: I run spark on yarn, I want to see the Jobs UI http://ip:4040/, but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ which can

Abount Jobs UI on yarn-client mode

2015-06-19 Thread Sea
Hi, all: I run spark on yarn, I want to see the Jobs UI http://ip:4040/, but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ which can not be found. Why? Anyone can help?

Abount Jobs UI in yarn-client mode

2015-06-19 Thread Sea
Hi, all: I run spark on yarn, I want to see the Jobs UI http://ip:4040/, but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ which can not be found. Why? Anyone can help?

Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Sea
Hi, all: I want to run spark sql on yarn(yarn-client), but ... I already set spark.yarn.jar and spark.jars in conf/spark-defaults.conf. ./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 game.txt Exception in thread main java.lang.NoClassDefFoundError:

Exception in using updateStateByKey

2015-04-27 Thread Sea
Hi, all: I use function updateStateByKey in Spark Streaming, I need to store the states for one minite, I set spark.cleaner.ttl to 120, the duration is 2 seconds, but it throws Exception Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist:

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea
the value for spark.cleaner.ttl larger ?Cheers On Mon, Apr 27, 2015 at 7:13 AM, Sea 261810...@qq.com wrote: my hadoop version is 2.2.0?? the hdfs-audit.log is too large?? The problem is that?? when the checkpoint info is deleted(it depends on ??spark.cleaner.ttl??)??it will throw

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea
; : Re: Exception in using updateStateByKey Can you make the value for spark.cleaner.ttl larger ?Cheers On Mon, Apr 27, 2015 at 7:13 AM, Sea 261810...@qq.com wrote: my hadoop version is 2.2.0?? the hdfs-audit.log is too large?? The problem is that?? when the checkpoint info is deleted

Filesystem closed Exception

2015-03-20 Thread Sea
Hi, all: When I exit the console of spark-sql, the following exception throwed.. My spark version is 1.3.0, hadoop version is 2.2.0 Exception in thread Thread-3 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) at

Filesystem closed Exception

2015-03-20 Thread Sea
Hi, all: When I exit the console of spark-sql, the following exception throwed.. Exception in thread Thread-3 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) at

InvalidAuxServiceException in dynamicAllocation

2015-03-17 Thread Sea
Hi, all: Spark1.3.0 hadoop2.2.0 I put the following params in the spark-defaults.conf spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors 20 spark.dynamicAllocation.maxExecutors 300 spark.dynamicAllocation.executorIdleTimeout 300 spark.shuffle.service.enabled true‍