SparkStraming job break with shuffle file not found

2018-03-28 Thread Jone Zhang
The spark streaming job running for a few days,then fail as below
What is the possible reason?

*18/03/25 07:58:37 ERROR yarn.ApplicationMaster: User class threw
exception: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 16 in stage 80018.0 failed 4 times, most recent failure: Lost
task 16.3 in stage 80018.0 (TID 7318859, 10.196.155.153):
java.io.FileNotFoundException:
/data/hadoop_tmp/nm-local-dir/usercache/mqq/appcache/application_1521712903594_6152/blockmgr-7aa2fb13-25d8-4145-a704-7861adfae4ec/22/shuffle_40009_16_0.data.574b45e8-bafd-437d-8fbf-deb6e3a1d001
(No such file or directory)*

Thanks!


Wish you give our product a wonderful name

2017-09-08 Thread Jone Zhang
We have built an an ml platform, based on open source framework like
hadoop, spark, tensorflow. Now we need to give our product a wonderful
name, and eager for everyone's advice.

Any answers will be greatly appreciated.
Thanks.


How can i split dataset to multi dataset

2017-08-06 Thread Jone Zhang
val schema = StructType(
  Seq(
  StructField("app", StringType, nullable = true),
  StructField("server", StringType, nullable = true),
  StructField("file", StringType, nullable = true),
  StructField("...", StringType, nullable = true)
  )
)
val row = ...
val dataset = session.createDataFrame(row, schema)

How can i split dataset to dataset array by composite key(app, server,file)
as follow
mapdataset>


Thanks.


Can i move TFS and TSFT out of spark package

2017-07-26 Thread Jone Zhang
I have build the spark-assembly-1.6.0-hadoop2.5.1.jar

cat spark-assembly-1.6.0-hadoop2.5.1.jar/META-INF/services/org.
apache.hadoop.fs.FileSystem
...
org.apache.hadoop.hdfs.DistributedFileSystem
org.apache.hadoop.hdfs.web.HftpFileSystem
org.apache.hadoop.hdfs.web.HsftpFileSystem
org.apache.hadoop.hdfs.web.WebHdfsFileSystem
org.apache.hadoop.hdfs.web.SWebHdfsFileSystem
tachyon.hadoop.TFS
tachyon.hadoop.TFSFT

Can i move TFS and TSFT out of spark-assembly-1.6.0-hadoop2.5.1.jar
How do I modify it before build


Thanks.


Re: Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-15 Thread Jone Zhang
Solve it by remove lazy identity.
2.HiveContext.sql("cache table feature as "select * from src where ...)
which result size is only 100K

Thanks!

2017-05-15 21:26 GMT+08:00 Yong Zhang <java8...@hotmail.com>:

> You should post the execution plan here, so we can provide more accurate
> support.
>
>
> Since in your feature table, you are building it with projection ("where
> "), so my guess is that the following JIRA (SPARK-13383
> <https://issues.apache.org/jira/browse/SPARK-13383>) stops the broadcast
> join. This is fixed in the Spark 2.x. Can you try it on Spark 2.0?
>
> Yong
>
> --
> *From:* Jone Zhang <joyoungzh...@gmail.com>
> *Sent:* Wednesday, May 10, 2017 7:10 AM
> *To:* user @spark/'user @spark'/spark users/user@spark
> *Subject:* Why spark.sql.autoBroadcastJoinThreshold not available
>
> Now i use spark1.6.0 in java
> I wish the following sql to be executed in BroadcastJoin way
> *select * from sample join feature*
>
> This is my step
> 1.set spark.sql.autoBroadcastJoinThreshold=100M
> 2.HiveContext.sql("cache lazy table feature as "select * from src where
> ...) which result size is only 100K
> 3.HiveContext.sql("select * from sample join feature")
> Why the join is SortMergeJoin?
>
> Grateful for any idea!
> Thanks.
>


How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Jone Zhang
For example
Data1(has 1 billion records)
user_id1  feature1
user_id1  feature2

Data2(has 1 billion records)
user_id1  feature3

Data3(has 1 billion records)
user_id1  feature4
user_id1  feature5
...
user_id1  feature100

I want to get the result as follow
user_id1  feature1 feature2 feature3 feature4 feature5...feature100

Is there a more efficient way except join?

Thanks!


Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-10 Thread Jone Zhang
Now i use spark1.6.0 in java
I wish the following sql to be executed in BroadcastJoin way
*select * from sample join feature*

This is my step
1.set spark.sql.autoBroadcastJoinThreshold=100M
2.HiveContext.sql("cache lazy table feature as "select * from src where
...) which result size is only 100K
3.HiveContext.sql("select * from sample join feature")
Why the join is SortMergeJoin?

Grateful for any idea!
Thanks.


org.apache.hadoop.fs.FileSystem: Provider tachyon.hadoop.TFS could not be instantiated

2017-05-05 Thread Jone Zhang
*When i use sparksql, the error as follows*

17/05/05 15:58:44 WARN scheduler.TaskSetManager: Lost task 0.0 in
stage 20.0 (TID 4080, 10.196.143.233):
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem:
Provider tachyon.hadoop.TFS could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:224)
at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2558)
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2569)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:365)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at 
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:654)
at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:321)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class
tachyon.hadoop.TFS
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:379)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
... 45 more

17/05/05 15:58:44 INFO cluster.YarnClusterScheduler: Removed TaskSet
20.0, whose tasks have all completed, from pool


I did not use tachyon directly.


*Grateful for any idea!*


Why chinese character gash appear when i use spark textFile?

2017-04-05 Thread Jone Zhang
var textFile = sc.textFile("xxx");
textFile.first();
res1: String = 1.0 100733314   18_?:100733314
8919173c6d49abfab02853458247e5841:129:18_?:1.0


hadoop fs -cat xxx
1.0100733314   18_百度输入法:100733314 8919173c6d49abfab02853458247e584
 1:129:18_百度输入法:1.0

Why  chinese character gash appear when i use spark textFile?
The code of hdfs file is utf-8.


Thanks


Is there length limit for sparksql/hivesql?

2016-10-26 Thread Jone Zhang
Is there length limit for sparksql/hivesql?
Can antlr work well if sql is too long?

Thanks.


Can i display message on console when use spark on yarn?

2016-10-20 Thread Jone Zhang
I submit spark with "spark-submit --master yarn-cluster --deploy-mode
cluster"
How can i display message on yarn console.
I expect it to be like this:

.
16/10/20 17:12:53 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: RUNNING)
16/10/20 17:12:58 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: RUNNING)
16/10/20 17:13:03 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: RUNNING)
16/10/20 17:13:08 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: RUNNING)
16/10/20 17:13:13 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: FINISHED)
16/10/20 17:13:13 main INFO org.apache.spark.deploy.yarn.Client>SPK>
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: 10.51.215.100
 ApplicationMaster RPC port: 0
 queue: root.default
 start time: 1476954698645
 final status: SUCCEEDED
 tracking URL:
http://10.179.20.47:8080/proxy/application_1453970859007_481440/history/application_1453970859007_481440/1
 user: mqq
===Spark Task Result is ===
===some message want to display===
16/10/20 17:13:13 Thread-3 INFO
org.apache.spark.util.ShutdownHookManager>SPK> Shutdown hook called
16/10/20 17:13:13 Thread-3 INFO
org.apache.spark.util.ShutdownHookManager>SPK> Deleting directory
/data/home/spark_tmp/spark-5b9f434b-5837-46e6-9625-c4656b86af9e

Thanks.


Re: High virtual memory consumption on spark-submit client.

2016-05-13 Thread jone
no, i have set master to yarn-cluster.
when the sparkpi.running,the result of  free -t as follow
[running]mqq@10.205.3.29:/data/home/hive/conf$ free -t
 total   used   free shared    buffers cached
Mem:  32740732   32105684 635048  0 683332   28863456
-/+ buffers/cache:    2558896   30181836
Swap:  2088952  60320    2028632
Total:    34829684   32166004    2663680
after sparkpi succes,the result as follow
[running]mqq@10.205.3.29:/data/home/hive/conf$ free -t
 total   used   free shared    buffers cached
Mem:  32740732   31614452    1126280  0 683624   28863096
-/+ buffers/cache:    2067732   30673000
Swap:  2088952  60320    2028632
Total:    34829684   31674772    3154912
Mich Talebzadeh <mich.talebza...@gmail.com>
于 2016年5月13日,14:47写道:Is this a standalone set up single host where executor runs inside the driver?also runfree -tTo see the virtual memory usage which is basically swap spacefree -t total   used   free shared    buffers cachedMem:  24546308   24268760 277548  0    1088236   15168668-/+ buffers/cache:    8011856   16534452Swap:  2031608    304    2031304Total:    26577916   24269064    2308852

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 


On 13 May 2016 at 07:36, Jone Zhang <joyoungzh...@gmail.com> wrote:mich, Do you want this
==
[running]mqq@10.205.3.29:/data/home/hive/conf$ ps aux | grep SparkPi
mqq      20070  3.6  0.8 10445048 267028 pts/16 Sl+ 13:09   0:11
/data/home/jdk/bin/java
-Dlog4j.configuration=file:///data/home/spark/conf/log4j.properties
-cp /data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/*:/data/home/spark/conf/:/data/home/spark/lib/spark-assembly-1.4.1-hadoop2.5.1_150903.jar:/data/home/spark/lib/datanucleus-api-jdo-3.2.6.jar:/data/home/spark/lib/datanucleus-core-3.2.10.jar:/data/home/spark/lib/datanucleus-rdbms-3.2.9.jar:/data/home/hadoop/conf/:/data/home/hadoop/conf/:/data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/*
-XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master
yarn-cluster --class org.apache.spark.examples.SparkPi --queue spark
--num-executors 4
/data/home/spark/lib/spark-examples-1.4.1-hadoop2.5.1.jar 1
mqq      22410  0.0  0.0 110600  1004 pts/8    S+   13:14   0:00 grep SparkPi
[running]mqq@10.205.3.29:/data/home/hive/conf$ top -p 20070

top - 13:14:48 up 504 days, 19:17, 19 users,  load average: 1.41, 1.10, 0.99
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.1%us,  2.7%sy,  0.0%ni, 74.4%id,  4.5%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  32740732k total, 31606288k used,  113k free,   475908k buffers
Swap:  2088952k total,    61076k used,  2027876k free, 27594452k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20070 mqq       20   0 10.0g 260m  32m S  0.0  0.8   0:11.38 java
==

Harsh, physical cpu cores is 1, virtual cpu cores is 4

Thanks.

2016-05-13 13:08 GMT+08:00, Harsh J <ha...@cloudera.com>:
> How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq
>
> You can also confirm the above by running the pmap utility on your process
> and most of the virtual memory would be under 'anon'.
>
> On Fri, 13 May 2016 09:11 jone, <joyoungzh...@gmail.com> wrote:
>
>> The virtual memory is 9G When i run org.apache.spark.examples.SparkPi
>> under yarn-cluster model,which using default configurations.
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>
>> 4519 mqq       20   0 9041 <2009041>m 248m  26m S  0.3  0.8   0:19.85
>> java
>>  I am curious why is so high?
>>
>> Thanks.
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




High virtual memory consumption on spark-submit client.

2016-05-12 Thread jone
The virtual memory is 9G When i run org.apache.spark.examples.SparkPi under yarn-cluster model,which using default configurations.
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
 4519 mqq   20   0 9041m 248m  26m S  0.3  0.8   0:19.85 java  
 I am curious why is so high?
Thanks.