Spark 3.3.0 with Structure Streaming from Kafka Issue on commons-pools2

2022-08-26 Thread Raymond Tang
Hi all,
I encountered one issue when reading from Kafka as stream and then sink into 
HDFS (using delta lake format).


java.lang.NoSuchMethodError: 
org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool$PoolConfig.setMinEvictableIdleTime(Ljava/time/Duration;)V
I looked into the details and found it occurred because Spark built-in jars has 
version  1.5.4 (commons-pool-1.5.4.jar) while this package 
org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 replies on version 2.11.1 and 
these two version are not compatible. Thus I’ve done one workaround for now to 
place the latest version into my Spark class path as the following page 
documents:
java.lang.NoSuchMethodError: 
PoolConfig.setMinEvictableIdleTime<https://kontext.tech/article/1178/javalangnosuchmethoderror-poolconfigsetminevictableidletime>

Question from me:

  *   Can we bump up the version in Spark future releases to avoid issues like 
this?
  *   Will this workaround cause side effects based on your knowledge?

I’m a frequent user of Spark but I don’t have much detailed knowledge in Spark 
underlying code (and I only looked into it whenever I need to debug a complex 
problem).

Thanks and Regards,
Raymond





Spark SQL Predict Pushdown for Hive Bucketed Table

2022-08-26 Thread Raymond Tang

Hi all,
Anyone knows why Spark SQL is not using Hive buckets pruning when reading from 
bucketed Hive table?
[SPARK-40206] Spark SQL Predict Pushdown for Hive Bucketed Table - ASF JIRA 
(apache.org)<https://issues.apache.org/jira/browse/SPARK-40206>

Details also provided at the end of mail.

Regards,
Raymond


Hi team,

I was testing out Hive bucket table features.  One of the benefits as most 
documentation suggested is that bucketed hive table can be used for query 
filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. Can you please 
help to clarify if Spark SQL supports query optimizations when using Hive 
bucketed table?



How to produce the issue:

Create a Hive 3 table using the following DDL:

create table test_db.bucket_table(user_id int, key string)

comment 'A bucketed table'

partitioned by(country string)

clustered by(user_id) sorted by (key) into 10 buckets

stored as ORC;

And then insert into this table using the following PySpark script:

from pyspark.sql import SparkSession



appName = "PySpark Hive Bucketing Example"

master = "local"



# Create Spark session with Hive supported.

spark = SparkSession.builder \

.appName(appName) \

.master(master) \

.enableHiveSupport() \

.getOrCreate()



# prepare sample data for inserting into hive table

data = []

countries = ['CN', 'AU']

for i in range(0, 1000):

data.append([int(i),  'U'+str(i), countries[i % 2]])



df = spark.createDataFrame(data, ['user_id', 'key', 'country'])

df.show()



# Save df to Hive table test_db.bucket_table



df.write.mode('append').insertInto('test_db.bucket_table')

Then query the table using the following script:

from pyspark.sql import SparkSession



appName = "PySpark Hive Bucketing Example"

master = "local"



# Create Spark session with Hive supported.

spark = SparkSession.builder \

.appName(appName) \

.master(master) \

.enableHiveSupport() \

.getOrCreate()



df = spark.sql("""select * from test_db.bucket_table

where country='AU' and user_id=101

""")

df.show()

df.explain(extended=True)

I am expecting to read from only one bucket file in HDFS but instead Spark 
scanned all bucket files in partition folder country=AU.

== Parsed Logical Plan ==

'Project [*]

 - 'Filter (('country = AU) AND ('t1.user_id = 101))

- 'SubqueryAlias t1

   - 'UnresolvedRelation [test_db, bucket_table], [], false



== Analyzed Logical Plan ==

user_id: int, key: string, country: string

Project [user_id#20, key#21, country#22]

 - Filter ((country#22 = AU) AND (user_id#20 = 101))

- SubqueryAlias t1

   - SubqueryAlias spark_catalog.test_db.bucket_table

  - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc



== Optimized Logical Plan ==

Filter (((isnotnull(country#22) AND isnotnull(user_id#20)) AND (country#22 = 
AU)) AND (user_id#20 = 101))

 - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc



== Physical Plan ==

*(1) Filter (isnotnull(user_id#20) AND (user_id#20 = 101))

 - *(1) ColumnarToRow

- FileScan orc test_db.bucket_table[user_id#20,key#21,country#22] Batched: 
true, DataFilters: [isnotnull(user_id#20), (user_id#20 = 101)], Format: ORC, 
Location: InMemoryFileIndex(1 
paths)[hdfs://localhost:9000/user/hive/warehouse/test_db.db/bucket_table/coun...,
 PartitionFilters: [isnotnull(country#22), (country#22 = AU)], PushedFilters: 
[IsNotNull(user_id), EqualTo(user_id,101)], ReadSchema: 
struct

Am I doing something wrong? or is it because Spark doesn't support it? Your 
guidance and help will be appreciated.






Re: Spark Event Log Forwarding and Offset Tracking

2021-02-04 Thread Raymond Tan
Thanks Jacek. Can you point me to some sample implementations of this that
I can use as a reference?

On Sun, Jan 17, 2021 at 10:09 PM Jacek Laskowski  wrote:

> Hi,
>
> > Forwarding Spark Event Logs to identify critical events like job start,
> executor failures, job failures etc to ElasticSearch via log4j. However I
> could not find any way to foward event log via log4j configurations. Is
> there any other recommended approach to track these application events?
>
> I'd use SparkListener API (
> http://spark.apache.org/docs/latest/api/scala/org/apache/spark/scheduler/SparkListener.html
> )
>
> > 2 - For Spark streaming jobs, is there any way to identify that data
> from Kafka is not consumed for whatever reason, or the offsets are not
> progressing as expected and also forward that to ElasticSearch via log4j
> for monitoring
>
> Think SparkListener API would help here too.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Wed, Jan 13, 2021 at 5:15 PM raymond.tan 
> wrote:
>
>> Hello here, I am new to spark and am trying to add some monitoring for
>> spark applications specifically to handle the below situations - 1 -
>> Forwarding Spark Event Logs to identify critical events like job start,
>> executor failures, job failures etc to ElasticSearch via log4j. However I
>> could not find any way to foward event log via log4j configurations. Is
>> there any other recommended approach to track these application events? 2 -
>> For Spark streaming jobs, is there any way to identify that data from Kafka
>> is not consumed for whatever reason, or the offsets are not progressing as
>> expected and also forward that to ElasticSearch via log4j for monitoring
>> Thanks, Raymond
>> --
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>


Re: Unable to run simple spark-sql

2019-06-21 Thread Raymond Honderdors
Good to hear
It was what I thought
Hard to validate with out the actual configuration
(Did not have time to setup ambari)


On Fri, Jun 21, 2019, 15:44 Nirmal Kumar  wrote:

> Hey Raymond,
>
> This root cause of the problem was the hive database location was
> 'file:/home/hive/spark-warehouse/testdb.db/employee_orc’
>
> I checked that using desc extended testdb.employee
>
> It might be some config issue in the cluster at that time that made the
> location to point to local filesystem.
>
> I created a new database and confirmed that the location was in HDFS
> i.e.hdfs://xxx:8020/apps/hive/warehouse/
> For this the code ran fine.
>
> Thanks for the help,
> -Nirmal
>
> From: Nirmal Kumar
> Sent: 19 June 2019 11:51
> To: Raymond Honderdors 
> Cc: user 
> Subject: RE: Unable to run simple spark-sql
>
> Hi Raymond,
>
> I cross checked hive/conf/hive-site.xml and spark2/conf/hive-site.xml
> Same value is being shown by Ambari Hive config.
> Seems correct value here:
>
>   
>   hive.metastore.warehouse.dir
>   /apps/hive/warehouse
>  
>
> Problem :
> Spark trying to create a local directory under the home directory of hive
> user (/home/hive/).
> Why is it referring the local file system and from where?
>
> Thanks,
> Nirmal
>
> From: Raymond Honderdors  raymond.honderd...@sizmek.com>>
> Sent: 19 June 2019 11:18
> To: Nirmal Kumar  nirmal.ku...@impetus.co.in>>
> Cc: user mailto:user@spark.apache.org>>
> Subject: Re: Unable to run simple spark-sql
>
> Hi Nirmal,
> i came across the following article "
> https://stackoverflow.com/questions/47497003/why-is-hive-creating-tables-in-the-local-file-system
> <
> https://secure-web.cisco.com/1eJXDpPVEl4WoA0ZWL4WJdfrYSsbn4TuKCqHt_IFHMsP29j7xLbCNNBf3Mvmm39OoR8qKeyuLZrkovYLX3CFWIyaUVQ2G3sCCFB9XdWPy_cd2sZrbiLq-hrsZ6rfmMFYZgd27mWYvc49jRUsx6YpUM1JNWdfOidNCVet4LOLJO3VV9kODNw0hhJAirwm0dpxceiGNfGSV_lJIDJvrPt-NG_SiqzFt9HGrOFCJCnCYJHbTlMGKh3LDbkFAvqhDvG8kYkmAU6eMvMUAkjSVQZGjP2uZg0fL1U-AwYPbfU1FsqKyd171Ctt3cFHwGgks1IxkBU-PhKMe4lwFoOI3KuMARwQOGuH2obX4ZJsgeZlZFQw/https%3A%2F%2Fstackoverflow.com%2Fquestions%2F47497003%2Fwhy-is-hive-creating-tables-in-the-local-file-system
> >"
> (and an updated ref link :
> https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration
> <
> https://secure-web.cisco.com/1lHF1a_dGhv0gGAUGVVJizv-j46GpuInCGeNUEhVAkSIeRS8079OhMBRiqwAoRNS9SXkMo_hZuQnvKuiKbSfXjmbSZwpbPTMrDdKaDOB0shFSn5B_9Xn99nORdhBXNdRB0otIq_iqx3_jNdvgWkxzmlQnLnI6-wE26x8ToJYq06GIN-NEi5K9ZvIvCGRt7xNQJaVsXmTpNNKJp0v5bJ8WfTVWt2sOpR1N8W1on7ZrJCKHl9mH8QTJNdRYWEYfF4HkMn5V8U_wGEOsTcx8RDOc7kZHisS_ZUrEwDPKA0PAk35HLJtQ-26-XF1teKiEh8oKB4U_3aMoMcC41nkdckQ7ig/https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FHive%2FAdminManual%2BMetastore%2BAdministration
> >)
> you should check "hive.metastore.warehouse.dir" in hive config files
>
>
> On Tue, Jun 18, 2019 at 8:09 PM Nirmal Kumar  <mailto:nirmal.ku...@impetus.co.in>> wrote:
> Just an update on the thread that it's kerberized.
>
> I'm trying to execute the query with a different user xyz not hive.
> Because seems like some permission issue the user xyz trying creating
> directory in /home/hive directory
>
> Do i need some impersonation setting?
>
> Thanks,
> Nirmal
>
> Get Outlook for Android<https://aka.ms/ghei36<
> https://secure-web.cisco.com/11z28bxN4NP4Z9g1qxRqPBXzLZShxonyI1ilwAlTV7-TyszSMWOzoSN6NKJr6jGA4169JJxYBOz8iEGs9x3uOAc9izmc36tkqKjjhkgHCJ9-BCf39p4n1xVDehS9j-LVMqvQ3E_0WFBUJS6iHhuj9iAwq_hgac83c0r_VYMtzPCsVC2dyLoiN2QaLQ4UjFMm8nv8ylOR-3ZpolBGGxEe0aKtWOm5o5iWnpTgF1uDzcAD0pDjikQCBS4FpMeXZL1T-LSQcoieAbZxNKH3_TO9PVC_CX_oedg3tlnuUaVFE3pq3DR5Ofx5YcuuGN43d3WGKK_2c8a6ZE74bdDI0IMDusQ/https%3A%2F%2Faka.ms%2Fghei36
> >>
>
> 
> From: Nirmal Kumar
> Sent: Tuesday, June 18, 2019 5:56:06 PM
> To: Raymond Honderdors; Nirmal Kumar
> Cc: user
> Subject: RE: Unable to run simple spark-sql
>
> Hi Raymond,
>
> Permission on hdfs is 777
> drwxrwxrwx   - impadmin hdfs  0 2019-06-13 16:09
> /home/hive/spark-warehouse
>
>
> But it’s pointing to a local file system:
> Exception in thread "main" java.lang.IllegalStateException: Cannot create
> staging directory
> 'file:/home/hive/spark-warehouse/testdb.db/employee_orc/.hive-staging_hive_2019-06-18_16-08-21_448_1691186175028734135-1'
>
> Thanks,
> -Nirmal
>
>
> From: Raymond Honderdors  raymond.honderd...@sizmek.com>>
> Sent: 18 June 2019 17:52
> To: Nirmal Kumar  nirmal.ku...@impetus.co.in>.invalid>
> Cc: user mailto:user@spark.apache.org>>
> Subject: Re: Unable to run simple spark-sql
>
> Hi
> Can you check the permission of th

Re: Unable to run simple spark-sql

2019-06-18 Thread Raymond Honderdors
Hi Nirmal,
i came across the following article "
https://stackoverflow.com/questions/47497003/why-is-hive-creating-tables-in-the-local-file-system
"
(and an updated ref link :
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration
)
you should check "hive.metastore.warehouse.dir" in hive config files


On Tue, Jun 18, 2019 at 8:09 PM Nirmal Kumar 
wrote:

> Just an update on the thread that it's kerberized.
>
> I'm trying to execute the query with a different user xyz not hive.
> Because seems like some permission issue the user xyz trying creating
> directory in /home/hive directory
>
> Do i need some impersonation setting?
>
> Thanks,
> Nirmal
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> 
> From: Nirmal Kumar
> Sent: Tuesday, June 18, 2019 5:56:06 PM
> To: Raymond Honderdors; Nirmal Kumar
> Cc: user
> Subject: RE: Unable to run simple spark-sql
>
> Hi Raymond,
>
> Permission on hdfs is 777
> drwxrwxrwx   - impadmin hdfs  0 2019-06-13 16:09
> /home/hive/spark-warehouse
>
>
> But it’s pointing to a local file system:
> Exception in thread "main" java.lang.IllegalStateException: Cannot create
> staging directory
> 'file:/home/hive/spark-warehouse/testdb.db/employee_orc/.hive-staging_hive_2019-06-18_16-08-21_448_1691186175028734135-1'
>
> Thanks,
> -Nirmal
>
>
> From: Raymond Honderdors 
> Sent: 18 June 2019 17:52
> To: Nirmal Kumar 
> Cc: user 
> Subject: Re: Unable to run simple spark-sql
>
> Hi
> Can you check the permission of the user running spark
> On the hdfs folder where it tries to create the table
>
> On Tue, Jun 18, 2019, 15:05 Nirmal Kumar  .invalid<mailto:nirmal.ku...@impetus.co.in.invalid>> wrote:
> Hi List,
>
> I tried running the following sample Java code using Spark2 version 2.0.0
> on YARN (HDP-2.5.0.0)
>
> public class SparkSQLTest {
>   public static void main(String[] args) {
> SparkSession sparkSession = SparkSession.builder().master("yarn")
> .config("spark.sql.warehouse.dir", "/apps/hive/warehouse")
> .config("hive.metastore.uris", "thrift://x:9083")
> .config("spark.driver.extraJavaOptions",
> "-Dhdp.version=2.5.0.0-1245")
> .config("spark.yarn.am<
> http://secure-web.cisco.com/1beuiC-aaBQJ0jgI7vONgZiTP5gCokYFEbllyW3ShZVdpQaIuYfuuEuS8iwzhqvwBE8C_E_bBe_7isO-HyPEVX6ZgJajKrQ6oWvTeBQCMjTHVCVImERG2S9qSHrH_mDzf656vrBFxAT1MYZhTZYzXl_3hyZ4BH-XCbKjXrCDyR1OR3tYqqDc7if9NJ1gqHWPwg84tho0__fut2d8y4XxMoMTQNnJzx5367QL6lYV5CFZj055coSLihVVYrh5jBID5jJF40PsrWSvdW7gJ_P6IAN9jTpHFJD7ZrokjlyS7WBAx5Mtnd2KxvNc2O6kKcxk2/http%3A%2F%2Fspark.yarn.am>.extraJavaOptions",
> "-Dhdp.version=2.5.0.0-1245")
> .config("spark.yarn.jars",
> "hdfs:///tmp/lib/spark2/*").enableHiveSupport().getOrCreate();
>
> sparkSession.sql("insert into testdb.employee_orc select * from
> testdb.employee where empid<5");
>   }
> }
>
> I get the following error pointing to a local file system
> (file:/home/hive/spark-warehouse) wondering from where its being picked:
>
> 16:08:21.321 [dispatcher-event-loop-7] INFO
> org.apache.spark.storage.BlockManagerInfo - Added broadcast_0_piece0 in
> memory on 192.168.218.92:40831<
> http://secure-web.cisco.com/18zd_gzhF2N4NeZyolJRHaQMm3mYmE7J-u5p8lbMjuy7lxIZN8zgUUzR8pAzFfMxMiTknORj-329_qyn9tpyQcLejfGKtMK8lhr24CVjsWQVC_YXrT8Ie0c3rifE3KxpJ2y2k58cNtAr0je4JPtzOp6x1HuSmOHLU6CXb80FNn2yi0-PBSRKBHYDJVGU9TlTto9wpY5gkO3U-u7BLR69hXgrqotcSHjzbipPVbI1-HcKKcTbYaEFEqUkM7yy9XJiBfxeqYYJyvstG-5JMJ8Vu8R9DU7gRE0VWMYDNKWPF9KAk_ky4jPHMYHf_DEJimDFI9l0OCyJlELPQs0iw1M6d5g/http%3A%2F%2F192.168.218.92%3A40831>
> (size: 30.6 KB, free: 4.0 GB)
> 16:08:21.322 [main] DEBUG org.apache.spark.storage.BlockManagerMaster -
> Updated info of block broadcast_0_piece0
> 16:08:21.323 [main] DEBUG org.apache.spark.storage.BlockManager - Told
> master about block broadcast_0_piece0
> 16:08:21.323 [main] DEBUG org.apache.spark.storage.BlockManager - Put
> block broadcast_0_piece0 locally took  4 ms
> 16:08:21.323 [main] DEBUG org.apache.spark.storage.BlockManager - Putting
> block broadcast_0_piece0 without replication took  4 ms
> 16:08:21.326 [main] INFO org.apache.spark.SparkContext - Created broadcast
> 0 from sql at SparkSQLTest.java:33
> 16:08:21.449 [main] DEBUG
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable - Created staging
> dir =
> file:/home/hive/spark-warehouse/testdb.db/employee_orc/.hive-staging_hive_2019-06-18_16-08-21_448_1691186175028734135-1
> for path = file:/home/hive/spark-warehouse/testdb.db/emplo

Re: Unable to run simple spark-sql

2019-06-18 Thread Raymond Honderdors
Hi
Can you check the permission of the user running spark
On the hdfs folder where it tries to create the table

On Tue, Jun 18, 2019, 15:05 Nirmal Kumar 
wrote:

> Hi List,
>
> I tried running the following sample Java code using Spark2 version 2.0.0
> on YARN (HDP-2.5.0.0)
>
> public class SparkSQLTest {
>   public static void main(String[] args) {
> SparkSession sparkSession = SparkSession.builder().master("yarn")
> .config("spark.sql.warehouse.dir", "/apps/hive/warehouse")
> .config("hive.metastore.uris", "thrift://x:9083")
> .config("spark.driver.extraJavaOptions",
> "-Dhdp.version=2.5.0.0-1245")
> .config("spark.yarn.am.extraJavaOptions",
> "-Dhdp.version=2.5.0.0-1245")
> .config("spark.yarn.jars",
> "hdfs:///tmp/lib/spark2/*").enableHiveSupport().getOrCreate();
>
> sparkSession.sql("insert into testdb.employee_orc select * from
> testdb.employee where empid<5");
>   }
> }
>
> I get the following error pointing to a local file system
> (file:/home/hive/spark-warehouse) wondering from where its being picked:
>
> 16:08:21.321 [dispatcher-event-loop-7] INFO
> org.apache.spark.storage.BlockManagerInfo - Added broadcast_0_piece0 in
> memory on 192.168.218.92:40831 (size: 30.6 KB, free: 4.0 GB)
> 16:08:21.322 [main] DEBUG org.apache.spark.storage.BlockManagerMaster -
> Updated info of block broadcast_0_piece0
> 16:08:21.323 [main] DEBUG org.apache.spark.storage.BlockManager - Told
> master about block broadcast_0_piece0
> 16:08:21.323 [main] DEBUG org.apache.spark.storage.BlockManager - Put
> block broadcast_0_piece0 locally took  4 ms
> 16:08:21.323 [main] DEBUG org.apache.spark.storage.BlockManager - Putting
> block broadcast_0_piece0 without replication took  4 ms
> 16:08:21.326 [main] INFO org.apache.spark.SparkContext - Created broadcast
> 0 from sql at SparkSQLTest.java:33
> 16:08:21.449 [main] DEBUG
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable - Created staging
> dir =
> file:/home/hive/spark-warehouse/testdb.db/employee_orc/.hive-staging_hive_2019-06-18_16-08-21_448_1691186175028734135-1
> for path = file:/home/hive/spark-warehouse/testdb.db/employee_orc
> 16:08:21.451 [main] INFO org.apache.hadoop.hive.common.FileUtils -
> Creating directory if it doesn't exist:
> file:/home/hive/spark-warehouse/testdb.db/employee_orc/.hive-staging_hive_2019-06-18_16-08-21_448_1691186175028734135-1
> Exception in thread "main" java.lang.IllegalStateException: Cannot create
> staging directory
> 'file:/home/hive/spark-warehouse/testdb.db/employee_orc/.hive-staging_hive_2019-06-18_16-08-21_448_1691186175028734135-1'
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getStagingDir(InsertIntoHiveTable.scala:83)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getExternalScratchDir(InsertIntoHiveTable.scala:97)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.getExternalTmpPath(InsertIntoHiveTable.scala:105)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:148)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
> at
> com..xxx.xxx.xxx..SparkSQLTest.main(SparkSQLTest.java:33)
> 16:08:21.454 [pool-8-thread-1] INFO org.apache.spark.SparkContext -
> Invoking stop() from shutdown hook
> 16:08:21.455 [pool-8-thread-1] DEBUG
> org.spark_project.jetty.util.component.AbstractLifeCycle - stopping
> org.spark_project.jetty.server.Server@620aa4ea
> 16:08:21.455 [pool-8-thread-1] DEBUG org.spark_project.jetty.server.Server
> - Graceful shutdown org.spark_project.jetty.server.Server@620aa4ea by
>
> Thanks,
> -Nirmal
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that 

Anomaly when dealing with Unix timestamp

2018-06-19 Thread Raymond Xie
Hello,

I have a dataframe, apply from_unixtime seems to expose an anomaly:

*scala> val bhDF4 = bhDF.withColumn("ts1", $"ts" + 28800).withColumn("ts2",
from_unixtime($"ts" + 28800,"MMddhhmmss"))*
*bhDF4: org.apache.spark.sql.DataFrame = [user_id: int, item_id: int ... 5
more fields]*

*scala> bhDF4.show*
*+---+---+---++--+--+--+*
*|user_id|item_id| cat_id|behavior|ts|   ts1|   ts2|*
*+---+---+---++--+--+--+*
*|  1|2268318|2520377|  pv|1511544070|1511572870|20171124082110|*
*|  1|246|2520771|  pv|1511561733|1511590533|20171125011533|*
*|  1|2576651| 149192|  pv|1511572885|1511601685|20171125042125|*
*|  1|3830808|4181361|  pv|1511593493|1511622293|20171125100453|*
*|  1|4365585|2520377|  pv|1511596146|1511624946|20171125104906|*
*|  1|4606018|2735466|  pv|1511616481|1511645281|20171125042801|*
*|  1| 230380| 411153|  pv|1511644942|1511673742|2017112612|*
*|  1|3827899|2920476|  pv|1511713473|1511742273|20171126072433|*
*|  1|3745169|2891509|  pv|1511725471|1511754271|20171126104431|*
*|  1|1531036|2920476|  pv|1511733732|1511762532|20171127010212|*
*|  1|2266567|4145813|  pv|1511741471|1511770271|2017112703|*
*|  1|2951368|1080785|  pv|1511750828|1511779628|20171127054708|*
*|  1|3108797|2355072|  pv|1511758881|1511787681|20171127080121|*
*|  1|1338525| 149192|  pv|1511773214|1511802014|20171127120014|*
*|  1|2286574|2465336|  pv|1511797167|1511825967|20171127063927|*
*|  1|5002615|2520377|  pv|1511839385|1511868185|20171128062305|*
*|  1|2734026|4145813|  pv|1511842184|1511870984|20171128070944|*
*|  1|5002615|2520377|  pv|1511844273|1511873073|20171128074433|*
*|  1|3239041|2355072|  pv|1511855664|1511884464|20171128105424|*
*|  1|4615417|4145813|  pv|1511870864|1511899664|20171128030744|*
*+---+---+---++--+--+--+*
*only showing top 20 rows*


*All ts2 are supposed to show date after 20171125 while there seems to be
at least one anomaly showing 20171124*

*Any thought?*



*Sincerely yours,*


*Raymond*


Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie
Thank you, that works.


**
*Sincerely yours,*


*Raymond*

On Tue, Jun 19, 2018 at 4:36 PM, Nicolas Paris  wrote:

> Hi Raymond
>
> Spark works well on single machine too, since it benefits from multiple
> core.
> The csv parser is based on univocity and you might use the
> "spark.read.csc" syntax instead of using the rdd api;
>
> From my experience, this will better than any other csv  parser
>
> 2018-06-19 16:43 GMT+02:00 Raymond Xie :
>
>> Thank you Matteo, Askash and Georg:
>>
>> I am attempting to get some stats first, the data is like:
>>
>> 1,4152983,2355072,pv,1511871096
>>
>> I like to find out the count of Key of (UserID, Behavior Type)
>>
>> val bh_count = 
>> sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
>>  => ((x(0).toInt,x(3)),1)).groupByKey()
>>
>> This shows me:
>> scala> val first = bh_count.first
>> [Stage 1:>  (0 +
>> 1) / 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak
>> detected; size = 15848112 bytes, TID = 110
>> first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1,
>> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
>> 1, 1))
>>
>>
>> *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
>> it in Windows where I have more RAM instead of Ubuntu so the env differs to
>> what I said in the original email)*
>> *Dataset is 3.6GB*
>>
>> *Thank you very much.*
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>> On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu  wrote:
>>
>>> Single machine? Any other framework will perform better than Spark
>>>
>>> On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
>>> wrote:
>>>
>>>> Georg, just asking, can Pandas handle such a big dataset? If that data
>>>> is further passed into using any of the sklearn modules?
>>>>
>>>> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler <
>>>> georg.kf.hei...@gmail.com> wrote:
>>>>
>>>>> use pandas or dask
>>>>>
>>>>> If you do want to use spark store the dataset as parquet / orc. And
>>>>> then continue to perform analytical queries on that dataset.
>>>>>
>>>>> Raymond Xie  schrieb am Di., 19. Juni 2018 um
>>>>> 04:29 Uhr:
>>>>>
>>>>>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
>>>>>> environment is 20GB ssd harddisk and 2GB RAM.
>>>>>>
>>>>>> The dataset comes with
>>>>>> User ID: 987,994
>>>>>> Item ID: 4,162,024
>>>>>> Category ID: 9,439
>>>>>> Behavior type ('pv', 'buy', 'cart', 'fav')
>>>>>> Unix Timestamp: span between November 25 to December 03, 2017
>>>>>>
>>>>>> I would like to hear any suggestion from you on how should I process
>>>>>> the dataset with my current environment.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> **
>>>>>> *Sincerely yours,*
>>>>>>
>>>>>>
>>>>>> *Raymond*
>>>>>>
>>>>>
>>>>
>>
>


Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie
Thank you Matteo, Askash and Georg:

I am attempting to get some stats first, the data is like:

1,4152983,2355072,pv,1511871096

I like to find out the count of Key of (UserID, Behavior Type)

val bh_count = 
sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
=> ((x(0).toInt,x(3)),1)).groupByKey()

This shows me:
scala> val first = bh_count.first
[Stage 1:>  (0 + 1)
/ 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak detected;
size = 15848112 bytes, TID = 110
first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1))


*Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
it in Windows where I have more RAM instead of Ubuntu so the env differs to
what I said in the original email)*
*Dataset is 3.6GB*

*Thank you very much.*
*----*
*Sincerely yours,*


*Raymond*

On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu  wrote:

> Single machine? Any other framework will perform better than Spark
>
> On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
> wrote:
>
>> Georg, just asking, can Pandas handle such a big dataset? If that data is
>> further passed into using any of the sklearn modules?
>>
>> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler > > wrote:
>>
>>> use pandas or dask
>>>
>>> If you do want to use spark store the dataset as parquet / orc. And then
>>> continue to perform analytical queries on that dataset.
>>>
>>> Raymond Xie  schrieb am Di., 19. Juni 2018 um
>>> 04:29 Uhr:
>>>
>>>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
>>>> environment is 20GB ssd harddisk and 2GB RAM.
>>>>
>>>> The dataset comes with
>>>> User ID: 987,994
>>>> Item ID: 4,162,024
>>>> Category ID: 9,439
>>>> Behavior type ('pv', 'buy', 'cart', 'fav')
>>>> Unix Timestamp: span between November 25 to December 03, 2017
>>>>
>>>> I would like to hear any suggestion from you on how should I process
>>>> the dataset with my current environment.
>>>>
>>>> Thank you.
>>>>
>>>> **
>>>> *Sincerely yours,*
>>>>
>>>>
>>>> *Raymond*
>>>>
>>>
>>


Best way to process this dataset

2018-06-18 Thread Raymond Xie
I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment is
20GB ssd harddisk and 2GB RAM.

The dataset comes with
User ID: 987,994
Item ID: 4,162,024
Category ID: 9,439
Behavior type ('pv', 'buy', 'cart', 'fav')
Unix Timestamp: span between November 25 to December 03, 2017

I would like to hear any suggestion from you on how should I process the
dataset with my current environment.

Thank you.

**
*Sincerely yours,*


*Raymond*


Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Raymond Xie
Thank you Subhash.

Here is the new command:
spark-submit --master local[*] --class retail_db.GetRevenuePerOrder --conf
spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client
/public/retail_db/order_items /home/rxie/output/revenueperorder

Still seeing the same issue here.
2018-06-17 11:51:25 INFO  RMProxy:98 - Connecting to ResourceManager at /
0.0.0.0:8032
2018-06-17 11:51:27 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:28 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:29 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:30 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:31 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:32 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:33 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:34 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:35 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)
2018-06-17 11:51:36 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is

   RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1000 MILLISECONDS)



**
*Sincerely yours,*


*Raymond*

On Sun, Jun 17, 2018 at 2:36 PM, Subhash Sriram 
wrote:

> Hi Raymond,
>
> If you set your master to local[*] instead of yarn-client, it should run
> on your local machine.
>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> On Jun 17, 2018, at 2:32 PM, Raymond Xie  wrote:
>
> Hello,
>
> I am wondering how can I run spark job in my environment which is a single
> Ubuntu host with no hadoop installed? if I run my job like below, I will
> end up with infinite loop at the end. Thank you very much.
>
> rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder
> --conf spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client
> /public/retail_db/order_items /home/rxie/output/revenueperorder
> 2018-06-17 11:19:36 WARN  Utils:66 - Your hostname, ubuntu resolves to a
> loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface
> ens33)
> 2018-06-17 11:19:36 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to
> bind to another address
> 2018-06-17 11:19:37 WARN  NativeCodeLoader:62 - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-06-17 11:19:38 INFO  SparkContext:54 - Running Spark version 2.3.1
> 2018-06-17 11:19:38 WARN  SparkConf:66 - spark.master yarn-client is
> deprecated in Spark 2.0+, please instead use "yarn" with specified deploy
> mode.
> 2018-06-17 11:19:38 INFO  SparkContext:54 - Submitted application: Get
> Revenue Per Order
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing view acls to: rxie
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing modify acls to:
> rxie
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing view acls groups
> to:
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - Changing modify acls groups
> to:
> 2018-06-17 11:19:38 INFO  SecurityManager:54 - SecurityManager:
> authentication disabled; ui acls disabled; users  with view permissions:
> Set(rxie); groups with view permissions: Set(); users  with modify
> permissions: Set(rxie); groups with modify permissions: Set()
> 2018-06-17 11:19:39 INFO  Utils:54 - Successfully sta

how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Raymond Xie
 - Started
o.s.j.s.ServletContextHandler@4ced35ed{/executors,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@2c22a348
{/executors/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@7bd69e82
{/executors/threadDump,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@74d7184a
{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@51b01960{/static,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@6ca320ab{/,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@50d68830{/api,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@6754ef00{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@619bd14c
{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-06-17 11:19:40 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and
started at http://192.168.112.141:12678
2018-06-17 11:19:41 INFO  SparkContext:54 - Added JAR
file:/home/rxie/data/spark2practice_2.11-0.1.jar at spark://
192.168.112.141:44709/jars/spark2practice_2.11-0.1.jar with timestamp
1529259581058
2018-06-17 11:19:44 INFO  RMProxy:98 - Connecting to ResourceManager at /
0.0.0.0:8032
2018-06-17 11:19:46 INFO  Client:871 - Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
MILLISECONDS)
Start the infinite loop here

**
*Sincerely yours,*


*Raymond*


Re: Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Raymond Xie
Thank you Vamshi,

Yes the path presumably has been added, here it is:

rxie@ubuntu:~/Downloads/spark$ echo $PATH
/home/rxie/Downloads/spark
:/usr/bin/java:/usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/spark:/home/rxie/Downloads/spark/bin:/usr/bin/java
rxie@ubuntu:~/Downloads/spark$ spark-shell
Error: Could not find or load main class org.apache.spark.launcher.Main



**
*Sincerely yours,*


*Raymond*

On Sun, Jun 17, 2018 at 8:44 AM, Vamshi Talla  wrote:

> Raymond,
>
>
> Is your SPARK_HOME set? In your .bash_profile, try setting the below:
>
> export SPARK_HOME=/home/Downloads/spark (or wherever your spark is
> downloaded to)
>
> once done, source your .bash_profile or restart the shell and try
> spark-shell
>
>
> Best Regards,
>
> Vamshi T
>
>
> --
> *From:* Raymond Xie 
> *Sent:* Sunday, June 17, 2018 6:27 AM
> *To:* user; Hui Xie
> *Subject:* Error: Could not find or load main class
> org.apache.spark.launcher.Main
>
> Hello,
>
> It would be really appreciated if anyone can help sort it out the
> following path issue for me? I highly doubt this is related to missing path
> setting but don't know how can I fix it.
>
>
>
> rxie@ubuntu:~/Downloads/spark$ echo $PATH
> /usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/
> bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/s
>
> park:/home/rxie/Downloads/spark/bin:/usr/bin/java
> rxie@ubuntu:~/Downloads/spark$ pyspark
> Error: Could not find or load main class org.apache.spark.launcher.Main
> rxie@ubuntu:~/Downloads/spark$ spark-shell
> Error: Could not find or load main class org.apache.spark.launcher.Main
> rxie@ubuntu:~/Downloads/spark$ pwd
> /home/rxie/Downloads/spark
> rxie@ubuntu:~/Downloads/spark$ ls
> bin  conf  data  examples  jars  kubernetes  licenses  R  yarn
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>


Error: Could not find or load main class org.apache.spark.launcher.Main

2018-06-17 Thread Raymond Xie
Hello,

It would be really appreciated if anyone can help sort it out the following
path issue for me? I highly doubt this is related to missing path setting
but don't know how can I fix it.



rxie@ubuntu:~/Downloads/spark$ echo $PATH
/usr/bin/java:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rxie/Downloads/s

park:/home/rxie/Downloads/spark/bin:/usr/bin/java
rxie@ubuntu:~/Downloads/spark$ pyspark
Error: Could not find or load main class org.apache.spark.launcher.Main
rxie@ubuntu:~/Downloads/spark$ spark-shell
Error: Could not find or load main class org.apache.spark.launcher.Main
rxie@ubuntu:~/Downloads/spark$ pwd
/home/rxie/Downloads/spark
rxie@ubuntu:~/Downloads/spark$ ls
bin  conf  data  examples  jars  kubernetes  licenses  R  yarn


**
*Sincerely yours,*


*Raymond*


spark-shell doesn't start

2018-06-17 Thread Raymond Xie
Hello, I am doing the practice in Ubuntu now, here is the error I am
encountering:


rxie@ubuntu:~/Downloads/spark/bin$ spark-shell
Error: Could not find or load main class org.apache.spark.launcher.Main


What am I missing?

Thank you very much.

Java is installed.

**
*Sincerely yours,*


*Raymond*


spark-submit Error: Cannot load main class from JAR file

2018-06-17 Thread Raymond Xie
Hello, I am doing the practice in windows now.
I have the jar file generated under:
C:\RXIE\Learning\Scala\spark2practice\target\scala-2.
11\spark2practice_2.11-0.1.jar

The package name is Retail_db and the object is GetRevenuePerOrder.

The spark-submit command is:

spark-submit retail_db.GetRevenuePerOrder --class "C:\RXIE\Learning\Scala\
spark2practice\target\scala-2.11\spark2practice_2.11-0.1.jar" local
 C:\RXIE\Learning\Data\data-master\retail_db\order_items
C:\C:\RXIE\Learning\Data\data-master\retail_db\order_items\revenue_per_order

When executing this command, it throws out the error:
Error: Cannot load main class from JAR file

What is the right submit command should I write here?


Thank you very much.

By the way, the command is really too long, how can I split it into
multiple lines like '\' in Linux?




**
*Sincerely yours,*


*Raymond*


Re: Not able to sort out environment settings to start spark from windows

2018-06-16 Thread Raymond Xie
Thank you. But there is no special char or space, I actually copied it from
Program Files to the root to ensure no space in the path.


**
*Sincerely yours,*


*Raymond*

On Sat, Jun 16, 2018 at 3:42 PM, vaquar khan  wrote:

> Plz check ur Java Home path .
> May be spacial char or space on ur path.
>
> Regards,
> Vaquar khan
>
> On Sat, Jun 16, 2018, 1:36 PM Raymond Xie  wrote:
>
>> I am trying to run spark-shell in Windows but receive error of:
>>
>> \Java\jre1.8.0_151\bin\java was unexpected at this time.
>>
>> Environment:
>>
>> System variables:
>>
>> SPARK_HOME:
>>
>> c:\spark
>>
>> Path:
>>
>> C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\
>> ProgramData\Anaconda2;C:\ProgramData\Anaconda2\Library\
>> mingw-w64\bin;C:\ProgramData\Anaconda2\Library\usr\bin;C:\
>> ProgramData\Anaconda2\Library\bin;C:\ProgramData\Anaconda2\
>> Scripts;C:\ProgramData\Oracle\Java\javapath;C:\Windows\
>> system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\
>> WindowsPowerShell\v1.0\;I:\Anaconda2;I:\Anaconda2\
>> Scripts;I:\Anaconda2\Library\bin;C:\Program Files
>> (x86)\sbt\\bin;C:\Program Files (x86)\Microsoft SQL
>> Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL
>> Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL
>> Server\100\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL
>> Server\100\Tools\Binn\VSShell\Common7\IDE\;C:\Program Files
>> (x86)\Microsoft Visual Studio 9.0\Common7\IDE\PrivateAssemblies\;C:\Program
>> Files (x86)\Microsoft SQL Server\100\DTS\Binn\;%DDPATH%;
>> %USERPROFILE%\.dnx\bin;C:\Program Files\Microsoft DNX\Dnvm\;C:\Program
>> Files\Microsoft SQL 
>> Server\130\Tools\Binn\;C:\jre1.8.0_151\bin\server;C:\Program
>> Files (x86)\OpenSSH\bin;C:\Program Files (x86)\Calibre2\;C:\Program
>> Files\nodejs\;C:\Program Files (x86)\Skype\Phone\;
>> %JAVA_HOME%\bin;%JAVA_HOME%\jre\bin;C:\Program Files
>> (x86)\scala\bin;C:\hadoop\bin;C:\Program Files\Git\cmd;I:\Program
>> Files\EmEditor; C:\RXIE\Learning\Spark\bin;C:\spark\bin
>>
>> JAVA_HOME:
>>
>> C:\jdk1.8.0_151\bin
>>
>> JDK_HOME:
>>
>> C:\jdk1.8.0_151
>>
>> I also copied all  C:\jdk1.8.0_151 to  C:\Java\jdk1.8.0_151, and
>> received the same error.
>>
>> Any help is greatly appreciated.
>>
>> Thanks.
>>
>>
>>
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>


Not able to sort out environment settings to start spark from windows

2018-06-16 Thread Raymond Xie
I am trying to run spark-shell in Windows but receive error of:

\Java\jre1.8.0_151\bin\java was unexpected at this time.

Environment:

System variables:

SPARK_HOME:

c:\spark

Path:

C:\Program Files (x86)\Common
Files\Oracle\Java\javapath;C:\ProgramData\Anaconda2;C:\ProgramData\Anaconda2\Library\mingw-w64\bin;C:\ProgramData\Anaconda2\Library\usr\bin;C:\ProgramData\Anaconda2\Library\bin;C:\ProgramData\Anaconda2\Scripts;C:\ProgramData\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;I:\Anaconda2;I:\Anaconda2\Scripts;I:\Anaconda2\Library\bin;C:\Program
Files (x86)\sbt\\bin;C:\Program Files (x86)\Microsoft SQL
Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL
Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL
Server\100\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL
Server\100\Tools\Binn\VSShell\Common7\IDE\;C:\Program Files (x86)\Microsoft
Visual Studio 9.0\Common7\IDE\PrivateAssemblies\;C:\Program Files
(x86)\Microsoft SQL
Server\100\DTS\Binn\;%DDPATH%;%USERPROFILE%\.dnx\bin;C:\Program
Files\Microsoft DNX\Dnvm\;C:\Program Files\Microsoft SQL
Server\130\Tools\Binn\;C:\jre1.8.0_151\bin\server;C:\Program Files
(x86)\OpenSSH\bin;C:\Program Files (x86)\Calibre2\;C:\Program
Files\nodejs\;C:\Program Files (x86)\Skype\Phone\;
%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin;C:\Program Files
(x86)\scala\bin;C:\hadoop\bin;C:\Program Files\Git\cmd;I:\Program
Files\EmEditor; C:\RXIE\Learning\Spark\bin;C:\spark\bin

JAVA_HOME:

C:\jdk1.8.0_151\bin

JDK_HOME:

C:\jdk1.8.0_151

I also copied all  C:\jdk1.8.0_151 to  C:\Java\jdk1.8.0_151, and received
the same error.

Any help is greatly appreciated.

Thanks.




**
*Sincerely yours,*


*Raymond*


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
Thank you very much Marco,

I am a beginner in this area, is it possible for you to show me what you
think the right script should be to get it executed in terminal?


**
*Sincerely yours,*


*Raymond*

On Sat, Feb 25, 2017 at 6:00 PM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Try to use --packages to include the jars. From error it seems it's
> looking for main class in jars but u r running a python script...
>
> On 25 Feb 2017 10:36 pm, "Raymond Xie" <xie3208...@gmail.com> wrote:
>
> That's right Anahita, however, the class name is not indicated in the
> original github project so I don't know what class should be used here. The
> github only says:
> and then run the example
> `$ bin/spark-submit --jars \
> external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar
> \
> examples/src/main/python/streaming/kafka_wordcount.py \
> localhost:2181 test`
> """ Can anyone give any thought on how to find out? Thank you very much
> in advance.
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Sat, Feb 25, 2017 at 5:27 PM, Anahita Talebi <anahita.t.am...@gmail.com
> > wrote:
>
>> You're welcome.
>> You need to specify the class. I meant like that:
>>
>> spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
>> 0-1245-hadoop2.7.3.2.5.0.0-1245.jar --class "give the name of the class"
>>
>>
>>
>> On Saturday, February 25, 2017, Raymond Xie <xie3208...@gmail.com> wrote:
>>
>>> Thank you, it is still not working:
>>>
>>> [image: Inline image 1]
>>>
>>> By the way, here is the original source:
>>>
>>> https://github.com/apache/spark/blob/master/examples/src/mai
>>> n/python/streaming/kafka_wordcount.py
>>>
>>>
>>> **
>>> *Sincerely yours,*
>>>
>>>
>>> *Raymond*
>>>
>>> On Sat, Feb 25, 2017 at 4:48 PM, Anahita Talebi <
>>> anahita.t.am...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think if you remove --jars, it will work. Like:
>>>>
>>>> spark-submit  /usr/hdp/2.5.0.0-1245/spark/l
>>>> ib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>>
>>>>  I had the same problem before and solved it by removing --jars.
>>>>
>>>> Cheers,
>>>> Anahita
>>>>
>>>> On Saturday, February 25, 2017, Raymond Xie <xie3208...@gmail.com>
>>>> wrote:
>>>>
>>>>> I am doing a spark streaming on a hortonworks sandbox and am stuck
>>>>> here now, can anyone tell me what's wrong with the following code and the
>>>>> exception it causes and how do I fix it? Thank you very much in advance.
>>>>>
>>>>> spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/li
>>>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>>>>
>>>>> Error:
>>>>> No main class set in JAR; please specify one with --class
>>>>>
>>>>>
>>>>> spark-submit --class /usr/hdp/2.5.0.0-1245/spark/li
>>>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>>>>
>>>>> Error:
>>>>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/li
>>>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>>>
>>>>> spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/l
>>>>> ibs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>>> /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0
>>>>> -1245-hadoop2.7.3.2.5.0.0-1245.jar  /root/hdp/kafka_wordcount.py
>>>>> 192.168.128.119:2181 test
>>>>>
>>>>> Error:
>>>>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/li
>>>>> bs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>>>
>>>>> **
>>>>> *Sincerely yours,*
>>>>>
>>>>>
>>>>> *Raymond*
>>>>>
>>>>
>>>
>
>


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
That's right Anahita, however, the class name is not indicated in the
original github project so I don't know what class should be used here. The
github only says:
and then run the example
`$ bin/spark-submit --jars \
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar
\
examples/src/main/python/streaming/kafka_wordcount.py \
localhost:2181 test`
""" Can anyone give any thought on how to find out? Thank you very much in
advance.


**
*Sincerely yours,*


*Raymond*

On Sat, Feb 25, 2017 at 5:27 PM, Anahita Talebi <anahita.t.am...@gmail.com>
wrote:

> You're welcome.
> You need to specify the class. I meant like that:
>
> spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
> 0-1245-hadoop2.7.3.2.5.0.0-1245.jar --class "give the name of the class"
>
>
>
> On Saturday, February 25, 2017, Raymond Xie <xie3208...@gmail.com> wrote:
>
>> Thank you, it is still not working:
>>
>> [image: Inline image 1]
>>
>> By the way, here is the original source:
>>
>> https://github.com/apache/spark/blob/master/examples/src/mai
>> n/python/streaming/kafka_wordcount.py
>>
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>> On Sat, Feb 25, 2017 at 4:48 PM, Anahita Talebi <
>> anahita.t.am...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think if you remove --jars, it will work. Like:
>>>
>>> spark-submit  /usr/hdp/2.5.0.0-1245/spark/l
>>> ib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>
>>>  I had the same problem before and solved it by removing --jars.
>>>
>>> Cheers,
>>> Anahita
>>>
>>> On Saturday, February 25, 2017, Raymond Xie <xie3208...@gmail.com>
>>> wrote:
>>>
>>>> I am doing a spark streaming on a hortonworks sandbox and am stuck here
>>>> now, can anyone tell me what's wrong with the following code and the
>>>> exception it causes and how do I fix it? Thank you very much in advance.
>>>>
>>>> spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/li
>>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>>>
>>>> Error:
>>>> No main class set in JAR; please specify one with --class
>>>>
>>>>
>>>> spark-submit --class /usr/hdp/2.5.0.0-1245/spark/li
>>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>>>
>>>> Error:
>>>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/li
>>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>>
>>>> spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/l
>>>> ibs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>> /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0
>>>> -1245-hadoop2.7.3.2.5.0.0-1245.jar  /root/hdp/kafka_wordcount.py
>>>> 192.168.128.119:2181 test
>>>>
>>>> Error:
>>>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/li
>>>> bs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>>
>>>> **
>>>> *Sincerely yours,*
>>>>
>>>>
>>>> *Raymond*
>>>>
>>>
>>


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
Thank you, it is still not working:

[image: Inline image 1]

By the way, here is the original source:

https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py


**
*Sincerely yours,*


*Raymond*

On Sat, Feb 25, 2017 at 4:48 PM, Anahita Talebi <anahita.t.am...@gmail.com>
wrote:

> Hi,
>
> I think if you remove --jars, it will work. Like:
>
> spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
> 0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>
>  I had the same problem before and solved it by removing --jars.
>
> Cheers,
> Anahita
>
> On Saturday, February 25, 2017, Raymond Xie <xie3208...@gmail.com> wrote:
>
>> I am doing a spark streaming on a hortonworks sandbox and am stuck here
>> now, can anyone tell me what's wrong with the following code and the
>> exception it causes and how do I fix it? Thank you very much in advance.
>>
>> spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/li
>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>
>> Error:
>> No main class set in JAR; please specify one with --class
>>
>>
>> spark-submit --class /usr/hdp/2.5.0.0-1245/spark/li
>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>
>> Error:
>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/li
>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>
>> spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/l
>> ibs/kafka-streams-0.10.0.2.5.0.0-1245.jar /usr/hdp/2.5.0.0-1245/spark/li
>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>  /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>
>> Error:
>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/li
>> bs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>


No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
I am doing a spark streaming on a hortonworks sandbox and am stuck here
now, can anyone tell me what's wrong with the following code and the
exception it causes and how do I fix it? Thank you very much in advance.

spark-submit --jars
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
 /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
/root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

Error:
No main class set in JAR; please specify one with --class


spark-submit --class
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
 /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
/root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

Error:
java.lang.ClassNotFoundException:
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar

spark-submit --class
 /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
 /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

Error:
java.lang.ClassNotFoundException:
/usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar

**
*Sincerely yours,*


*Raymond*


Re: How to connect Tableau to databricks spark?

2017-01-09 Thread Raymond Xie
To exclude firewall blocking the port,  I added a rule in windows firewall
to enable all inbound and outbound port 1. I then tried telnet
ec2-35-160-128-113.us-west-2.compute.amazonaws.com 1 in putty and it
still doesn't work,


**
*Sincerely yours,*


*Raymond*

On Mon, Jan 9, 2017 at 4:53 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Also, meant to add the link to the docs: https://docs.databricks.com/
> user-guide/faq/tableau.html
>
>
>
>
>
> *From: *Silvio Fiorito <silvio.fior...@granturing.com>
> *Date: *Monday, January 9, 2017 at 2:59 PM
> *To: *Raymond Xie <xie3208...@gmail.com>, user <user@spark.apache.org>
> *Subject: *Re: How to connect Tableau to databricks spark?
>
>
>
> Hi Raymond,
>
>
>
> Are you using a Spark 2.0 or 1.6 cluster? With Spark 2.0 it’s just a
> matter of entering the hostname of your Databricks environment, the HTTP
> path from the cluster page, and your Databricks credentials.
>
>
>
> Thanks,
>
> Silvio
>
>
>
> *From: *Raymond Xie <xie3208...@gmail.com>
> *Date: *Sunday, January 8, 2017 at 10:30 PM
> *To: *user <user@spark.apache.org>
> *Subject: *How to connect Tableau to databricks spark?
>
>
>
> I want to do some data analytics work by leveraging Databricks spark
> platform and connect my Tableau desktop to it for data visualization.
>
>
>
> Does anyone ever make it? I've trying to follow the instruction below but
> not successful?
>
>
>
> https://docs.cloud.databricks.com/docs/latest/databricks_
> guide/01%20Databricks%20Overview/14%20Third%20Party%
> 20Integrations/01%20Setup%20JDBC%20or%20ODBC.html
>
>
>
>
>
> I got an error message in Tableau's attempt to connect:
>
>
>
> Unable to connect to the server "ec2-35-160-128-113.us-west-2.
> compute.amazonaws.com". Check that the server is running and that you
> have access privileges to the requested database.
>
>
>
> "ec2-35-160-128-113.us-west-2.compute.amazonaws.com" is the hostname of a
> EC2 instance I just created on AWS, I may have some missing there though as
> I am new to AWS.
>
>
>
> I am not sure that is related to account issue, I was using my Databricks
> account in Tableau to connect it.
>
>
>
> Thank you very much. Any clue is appreciated.
>
>
> **
>
> *Sincerely yours,*
>
>
>
> *Raymond*
>


How to connect Tableau to databricks spark?

2017-01-08 Thread Raymond Xie
I want to do some data analytics work by leveraging Databricks spark
platform and connect my Tableau desktop to it for data visualization.

Does anyone ever make it? I've trying to follow the instruction below but
not successful?

https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks%20Overview/14%20Third%20Party%20Integrations/01%20Setup%20JDBC%20or%20ODBC.html


I got an error message in Tableau's attempt to connect:

Unable to connect to the server "
ec2-35-160-128-113.us-west-2.compute.amazonaws.com". Check that the server
is running and that you have access privileges to the requested database.

"ec2-35-160-128-113.us-west-2.compute.amazonaws.com" is the hostname of a
EC2 instance I just created on AWS, I may have some missing there though as
I am new to AWS.

I am not sure that is related to account issue, I was using my Databricks
account in Tableau to connect it.

Thank you very much. Any clue is appreciated.

**
*Sincerely yours,*


*Raymond*


subsription

2017-01-08 Thread Raymond Xie
**
*Sincerely yours,*


*Raymond*


Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
Thank you very much Marco, is your code in Scala? do you have a python
example? Can anyone give me a python example to handle json data on Spark?


**
*Sincerely yours,*


*Raymond*

On Sun, Jan 1, 2017 at 12:29 PM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Hi
>you will need to pass the schema, like in the snippet below (even
> though the code might have been superseeded in spark 2.0)
>
> import sqlContext.implicits._
> val jsonRdd = sc.textFile("file:///c:/tmp/1973-01-11.json")
> val schema = (new StructType).add("hour", StringType).add("month",
> StringType)
>   .add("second", StringType).add("year", StringType)
>   .add("timezone", StringType).add("day", StringType)
>   .add("minute", StringType)
> val jsonContentWithSchema = sqlContext.jsonRDD(jsonRdd, schema)
>
> But somehow i seem to remember that there was a way , in Spark 2.0, so
> that Spark will infer the schema  for you..
>
> hth
> marco
>
>
>
>
>
> On Sun, Jan 1, 2017 at 12:40 PM, Raymond Xie <xie3208...@gmail.com> wrote:
>
>> I found the cause:
>>
>> I need to "put" the json file onto hdfs first before it can be used, here
>> is what I did:
>>
>> hdfs dfs -put  /root/Downloads/data/json/world_bank.json
>> hdfs://localhost:9000/json
>> df = sqlContext.read.json("/json/")
>> df.show(10)
>>
>> .
>>
>> However, there is a new problem here, the json data needs to be sort of
>> treaked before it can be really used, simply using df =
>> sqlContext.read.json("/json/") just makes the df messy, I need the df know
>> the fields in the json file.
>>
>> How?
>>
>> Thank you.
>>
>>
>>
>>
>> *----*
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>> On Sat, Dec 31, 2016 at 11:52 PM, Miguel Morales <therevolti...@gmail.com
>> > wrote:
>>
>>> Looks like it's trying to treat that path as a folder, try omitting
>>> the file name and just use the folder path.
>>>
>>> On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie <xie3208...@gmail.com>
>>> wrote:
>>> > Happy new year!!!
>>> >
>>> > I am trying to load a json file into spark, the json file is attached
>>> here.
>>> >
>>> > I received the following error, can anyone help me to fix it? Thank
>>> you very
>>> > much. I am using Spark 1.6.2 and python 2.7.5
>>> >
>>> >>>> from pyspark.sql import SQLContext
>>> >>>> sqlContext = SQLContext(sc)
>>> >>>> df = sqlContext.read.json("/root/Downloads/data/json/world_bank.j
>>> son")
>>> > 16/12/31 22:54:53 INFO json.JSONRelation: Listing
>>> > hdfs://localhost:9000/root/Downloads/data/json/world_bank.json on
>>> driver
>>> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0 stored as
>>> > values in memory (estimated size 212.4 KB, free 212.4 KB)
>>> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0_piece0
>>> stored
>>> > as bytes in memory (estimated size 19.6 KB, free 232.0 KB)
>>> > 16/12/31 22:54:54 INFO storage.BlockManagerInfo: Added
>>> broadcast_0_piece0 in
>>> > memory on localhost:39844 (size: 19.6 KB, free: 511.1 MB)
>>> > 16/12/31 22:54:54 INFO spark.SparkContext: Created broadcast 0 from
>>> json at
>>> > NativeMethodAccessorImpl.java:-2
>>> > Traceback (most recent call last):
>>> >   File "", line 1, in 
>>> >   File "/opt/spark/python/pyspark/sql/readwriter.py", line 176, in
>>> json
>>> > return self._df(self._jreader.json(path))
>>> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>>> line
>>> > 813, in __call__
>>> >   File "/opt/spark/python/pyspark/sql/utils.py", line 45, in deco
>>> > return f(*a, **kw)
>>> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
>>> line 308,
>>> > in get_return_value
>>> > py4j.protocol.Py4JJavaError: An error occurred while calling o19.json.
>>> > : java.io.IOException: No input paths specified in job
>>> > at
>>> > org.apache.hadoop.mapred.FileInputFormat.listStatus(Fi

Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
I found the cause:

I need to "put" the json file onto hdfs first before it can be used, here
is what I did:

hdfs dfs -put  /root/Downloads/data/json/world_bank.json
hdfs://localhost:9000/json
df = sqlContext.read.json("/json/")
df.show(10)

.

However, there is a new problem here, the json data needs to be sort of
treaked before it can be really used, simply using df =
sqlContext.read.json("/json/") just makes the df messy, I need the df know
the fields in the json file.

How?

Thank you.




*----*
*Sincerely yours,*


*Raymond*

On Sat, Dec 31, 2016 at 11:52 PM, Miguel Morales <therevolti...@gmail.com>
wrote:

> Looks like it's trying to treat that path as a folder, try omitting
> the file name and just use the folder path.
>
> On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie <xie3208...@gmail.com> wrote:
> > Happy new year!!!
> >
> > I am trying to load a json file into spark, the json file is attached
> here.
> >
> > I received the following error, can anyone help me to fix it? Thank you
> very
> > much. I am using Spark 1.6.2 and python 2.7.5
> >
> >>>> from pyspark.sql import SQLContext
> >>>> sqlContext = SQLContext(sc)
> >>>> df = sqlContext.read.json("/root/Downloads/data/json/world_
> bank.json")
> > 16/12/31 22:54:53 INFO json.JSONRelation: Listing
> > hdfs://localhost:9000/root/Downloads/data/json/world_bank.json on driver
> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0 stored as
> > values in memory (estimated size 212.4 KB, free 212.4 KB)
> > 16/12/31 22:54:54 INFO storage.MemoryStore: Block broadcast_0_piece0
> stored
> > as bytes in memory (estimated size 19.6 KB, free 232.0 KB)
> > 16/12/31 22:54:54 INFO storage.BlockManagerInfo: Added
> broadcast_0_piece0 in
> > memory on localhost:39844 (size: 19.6 KB, free: 511.1 MB)
> > 16/12/31 22:54:54 INFO spark.SparkContext: Created broadcast 0 from json
> at
> > NativeMethodAccessorImpl.java:-2
> > Traceback (most recent call last):
> >   File "", line 1, in 
> >   File "/opt/spark/python/pyspark/sql/readwriter.py", line 176, in json
> > return self._df(self._jreader.json(path))
> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
> line
> > 813, in __call__
> >   File "/opt/spark/python/pyspark/sql/utils.py", line 45, in deco
> > return f(*a, **kw)
> >   File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line
> 308,
> > in get_return_value
> > py4j.protocol.Py4JJavaError: An error occurred while calling o19.json.
> > : java.io.IOException: No input paths specified in job
> > at
> > org.apache.hadoop.mapred.FileInputFormat.listStatus(
> FileInputFormat.java:201)
> > at
> > org.apache.hadoop.mapred.FileInputFormat.getSplits(
> FileInputFormat.java:313)
> > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
> MapPartitionsRDD.scala:35)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
> MapPartitionsRDD.scala:35)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> > at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(
> RDD.scala:1129)
> > at
> > org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:150)
> > at
> > org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:111)
> > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
> > at
> > org.apache.spark.sql.execution.datasources.json.InferSchema$.infer(
> InferSchema.scala:65)
> > at
> > org.apache.spark.sql.execution.datasources.json.
> JSONRelation$$anonfun$4.apply(JSONRelation.scala:114)
> &

Re: Error when loading json to spark

2017-01-01 Thread Raymond Xie
Thank you Miguel, here is the output:

>>>df = sqlContext.read.json("/root/Downloads/data/json")
17/01/01 07:28:19 INFO json.JSONRelation: Listing
hdfs://localhost:9000/root/Downloads/data/json on driver
17/01/01 07:28:19 INFO storage.MemoryStore: Block broadcast_2 stored as
values in memory (estimated size 212.4 KB, free 444.5 KB)
17/01/01 07:28:19 INFO storage.MemoryStore: Block broadcast_2_piece0 stored
as bytes in memory (estimated size 19.6 KB, free 464.1 KB)
17/01/01 07:28:19 INFO storage.BlockManagerInfo: Added broadcast_2_piece0
in memory on localhost:39844 (size: 19.6 KB, free: 511.1 MB)
17/01/01 07:28:19 INFO spark.SparkContext: Created broadcast 2 from json at
NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/readwriter.py", line 176, in json
return self._df(self._jreader.json(path))
  File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
813, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308,
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o115.json.
: java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1129)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
at
org.apache.spark.sql.execution.datasources.json.InferSchema$.infer(InferSchema.scala:65)
at
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:114)
at
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109)
at
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108)
at
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
at
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

>>>



*----*
*Sincerely yours,*


*Raymond*

On Sat, Dec 31, 2016 at 11:52 PM, Miguel Morales <therevolti...@gmail.com>
wrote:

> Looks like it's trying to treat that path as a folder, try omitting
> the file name and just use the folder path.
>
> On Sat, Dec 31, 2016 at 7:58 PM, Raymond Xie <xie3208...@gmail.com> wrote:
> > Happy new year!!!
> >
> > I

From Hive to Spark, what is the default database/table

2016-12-31 Thread Raymond Xie
Hello,


It is indicated in
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#dataframes
 when Running SQL Queries Programmatically you can do:

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)df =
sqlContext.sql("SELECT * FROM table")

However, it did not indicate what should be put there as "table", for
example in my case I do have couple datawarehouses and tables and one of it
is:
hdfs dfs -ls hdfs://localhost:9000/user/hive/warehouse/flight201601

if I use:
>>> df = sqlContext.sql("SELECT * FROM flight201601")

it will prompt:
pyspark.sql.utils.AnalysisException: u'Table not found: flight201601;'



How do I write the sql query if I want to select from flight201601?

Thank you.



*----*
*Sincerely yours,*


*Raymond*


Re: How to load a big csv to dataframe in Spark 1.6

2016-12-31 Thread Raymond Xie
Hello Felix,

I followed the instruction and ran the command:

> $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0

and I received the following error message:
java.lang.RuntimeException: java.net.ConnectException: Call From xie1/
192.168.112.150 to localhost:9000 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused

any thought?



**
*Sincerely yours,*


*Raymond*

On Fri, Dec 30, 2016 at 10:08 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Have you tried the spark-csv package?
>
> https://spark-packages.org/package/databricks/spark-csv
>
>
> ------
> *From:* Raymond Xie <xie3208...@gmail.com>
> *Sent:* Friday, December 30, 2016 6:46:11 PM
> *To:* user@spark.apache.org
> *Subject:* How to load a big csv to dataframe in Spark 1.6
>
> Hello,
>
> I see there is usually this way to load a csv to dataframe:
>
> sqlContext = SQLContext(sc)
>
> Employee_rdd = sc.textFile("\..\Employee.csv")
>.map(lambda line: line.split(","))
>
> Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
>
> Employee_df.show()
>
> However in my case my csv has 100+ fields, which means toDF() will be very
> lengthy.
>
> Can anyone tell me a practical method to load the data?
>
> Thank you very much.
>
>
> *Raymond*
>
>


Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
Thanks Felix, I will try it tomorrow

~~~sent from my cell phone, sorry if there is any typo

2016年12月30日 下午10:08,"Felix Cheung" <felixcheun...@hotmail.com>写道:

> Have you tried the spark-csv package?
>
> https://spark-packages.org/package/databricks/spark-csv
>
>
> ------
> *From:* Raymond Xie <xie3208...@gmail.com>
> *Sent:* Friday, December 30, 2016 6:46:11 PM
> *To:* user@spark.apache.org
> *Subject:* How to load a big csv to dataframe in Spark 1.6
>
> Hello,
>
> I see there is usually this way to load a csv to dataframe:
>
> sqlContext = SQLContext(sc)
>
> Employee_rdd = sc.textFile("\..\Employee.csv")
>.map(lambda line: line.split(","))
>
> Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
>
> Employee_df.show()
>
> However in my case my csv has 100+ fields, which means toDF() will be very
> lengthy.
>
> Can anyone tell me a practical method to load the data?
>
> Thank you very much.
>
>
> *Raymond*
>
>


Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
yes, I believe there should be a better way to handle my case.

~~~sent from my cell phone, sorry if there is any typo

2016年12月30日 下午10:09,"write2sivakumar@gmail" <write2sivaku...@gmail.com>写道:

Hi Raymond,

Your problem is to pass those 100 fields to .toDF() method??



Sent from my Samsung device


 Original message ----
From: Raymond Xie <xie3208...@gmail.com>
Date: 31/12/2016 10:46 (GMT+08:00)
To: user@spark.apache.org
Subject: How to load a big csv to dataframe in Spark 1.6

Hello,

I see there is usually this way to load a csv to dataframe:

sqlContext = SQLContext(sc)

Employee_rdd = sc.textFile("\..\Employee.csv")
   .map(lambda line: line.split(","))

Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])

Employee_df.show()

However in my case my csv has 100+ fields, which means toDF() will be very
lengthy.

Can anyone tell me a practical method to load the data?

Thank you very much.


*Raymond*


How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
Hello,

I see there is usually this way to load a csv to dataframe:

sqlContext = SQLContext(sc)

Employee_rdd = sc.textFile("\..\Employee.csv")
   .map(lambda line: line.split(","))

Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])

Employee_df.show()

However in my case my csv has 100+ fields, which means toDF() will be very
lengthy.

Can anyone tell me a practical method to load the data?

Thank you very much.


*Raymond*


What's the best practice to load data from RDMS to Spark

2016-12-30 Thread Raymond Xie
Hello,

I am new to Spark, as a SQL developer, I only took some courses online and
spent some time myself, never had a chance working on a real project.

I wonder what would be the best practice (tool, procedure...) to load data
(csv, excel) into Spark platform?

Thank you.



*Raymond*


Re: Does SparkSql thrift server support insert/update/delete sql statement

2016-03-28 Thread Raymond Honderdors
It should
Depensing on the storage used

I am facing a simular issue running spark on emr

I got emr login errors for insert

Sent from Outlook Mobile



On Mon, Mar 28, 2016 at 10:31 PM -0700, "sage" 
> wrote:

Does SparkSql thrift server support insert/update/delete sql statement when
connecting using jdbc?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-SparkSql-thrift-server-support-insert-update-delete-sql-statement-tp26618.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[Read More]

[http://www.sizmek.com/Sizmek.png]


RE: Does SparkSql has official jdbc/odbc driver?

2016-03-27 Thread Raymond Honderdors
For now they are for free




Sent from my Samsung Galaxy smartphone.


 Original message 
From: Sage Meng <lkke...@gmail.com>
Date: 3/28/2016 04:14 (GMT+02:00)
To: Raymond Honderdors <raymond.honderd...@sizmek.com>
Cc: mich.talebza...@gmail.com, user@spark.apache.org
Subject: Re: Does SparkSql has official jdbc/odbc driver?

Hi Raymond Honderdors,
   are odbc/jdbc drivers for spark sql from databricks free, or the drivers 
from databricks can only be used on databricks's spark-sql release?

2016-03-25 17:48 GMT+08:00 Raymond Honderdors 
<raymond.honderd...@sizmek.com<mailto:raymond.honderd...@sizmek.com>>:

Recommended drivers for spark / thrift are the once from databricks (simba)

My experiance is that the databricks driver works perfect on windows and linux
On windows you can get the microsoft driver
Both are odbc

Not jet tried the jdbc drivers

Sent from Outlook Mobile<https://aka.ms/blhgte>



On Fri, Mar 25, 2016 at 1:23 AM -0700, "Mich Talebzadeh" 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

JDBC drivers are specific to the databases you are accessing. they are produced 
by the database vendors, For example Oracle one is called ojdbc6.jar and Sybase 
is called jconn4.jar. Hive has got its own drivers

There are companies that produce JDBC or ODBC drivers for various databases 
like Progress Direct,

Spark is a query tools not a database, so it does not have its own drivers.

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 25 March 2016 at 06:33, sage <lkke...@gmail.com<mailto:lkke...@gmail.com>> 
wrote:
Hi all,
Does SparkSql has official jdbc/odbc driver? I only saw third-party's
jdbc/odbc driver.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-SparkSql-has-official-jdbc-odbc-driver-tp26591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>





Re: Does SparkSql has official jdbc/odbc driver?

2016-03-25 Thread Raymond Honderdors
Recommended drivers for spark / thrift are the once from databricks (simba)

My experiance is that the databricks driver works perfect on windows and linux
On windows you can get the microsoft driver
Both are odbc

Not jet tried the jdbc drivers

Sent from Outlook Mobile



On Fri, Mar 25, 2016 at 1:23 AM -0700, "Mich Talebzadeh" 
> wrote:

JDBC drivers are specific to the databases you are accessing. they are produced 
by the database vendors, For example Oracle one is called ojdbc6.jar and Sybase 
is called jconn4.jar. Hive has got its own drivers

There are companies that produce JDBC or ODBC drivers for various databases 
like Progress Direct,

Spark is a query tools not a database, so it does not have its own drivers.

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 25 March 2016 at 06:33, sage > 
wrote:
Hi all,
Does SparkSql has official jdbc/odbc driver? I only saw third-party's
jdbc/odbc driver.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-SparkSql-has-official-jdbc-odbc-driver-tp26591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org




Re: Spark with Druid

2016-03-23 Thread Raymond Honderdors
I saw these but i fail to understand how to direct the code to use rhe index 
json

Sent from Outlook Mobile<https://aka.ms/blhgte>



On Wed, Mar 23, 2016 at 7:19 AM -0700, "Ted Yu" 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:

Please see:
https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani

which references https://github.com/SparklineData/spark-druid-olap

On Wed, Mar 23, 2016 at 5:59 AM, Raymond Honderdors 
<raymond.honderd...@sizmek.com<mailto:raymond.honderd...@sizmek.com>> wrote:
Does anyone have a good overview on how to integrate Spark and Druid?

I am now struggling with the creation of a druid data source in spark.


Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com<mailto:raymond.honderd...@sizmek.com>
T +972.7325.3569
Herzliya


[Read More]<http://feeds.feedburner.com/~r/sizmek-blog/~6/1>

[http://www.sizmek.com/Sizmek.png]<http://www.sizmek.com/>



Spark with Druid

2016-03-23 Thread Raymond Honderdors
Does anyone have a good overview on how to integrate Spark and Druid?

I am now struggling with the creation of a druid data source in spark.


Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com<mailto:raymond.honderd...@sizmek.com>
T +972.7325.3569
Herzliya


[Read More]<http://feeds.feedburner.com/~r/sizmek-blog/~6/1>

[http://www.sizmek.com/Sizmek.png]<http://www.sizmek.com/>


SparkSQL 2.0 snapshot - thrift server behavior

2016-03-21 Thread Raymond Honderdors
Hi,

We were running with spark 1.6.x and using the "SHOW TABLES IN 'default'" 
command to read the list of tables. I have noticed that when I run the same on 
version 2.0.0 I get an empty result, but when I run "SHOW TABLES" I get the 
result I am after.

Can we get the support back for the "SHOW TABLES IN 'default'"?


Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com<mailto:raymond.honderd...@sizmek.com>
T +972.7325.3569
Herzliya


[Read More]<http://feeds.feedburner.com/~r/sizmek-blog/~6/1>

[http://www.sizmek.com/Sizmek.png]<http://www.sizmek.com/>


Re: Why mapred for the HadoopRDD?

2014-11-04 Thread raymond
You could take a look at sc.newAPIHadoopRDD()


在 2014年11月5日,上午9:29,Corey Nolet cjno...@gmail.com 写道:

 I'm fairly new to spark and I'm trying to kick the tires with a few 
 InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf 
 instead of a MapReduce Job object. Is there future planned support for the 
 mapreduce packaging?
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL , best way to divide data into partitions?

2014-10-22 Thread raymond
Hi

I have a json file that can be load by sqlcontext.jsonfile into a 
table. but this table is not partitioned.

Then I wish to transform this table into a partitioned table say on 
field “date” etc. what will be the best approaching to do this?  seems in hive 
this is usually done by load data into a dedicated partition directly. but if I 
don’t want to select data out by a specific partition then insert it with each 
partition field value. How should I do it in a quick way? And how to do it in 
Spark sql?

raymond
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: All executors run on just a few nodes

2014-10-20 Thread raymond
when the data’s source host is not one of the registered executors, it will 
also be marked as PROCESS_LOCAL too, though it should have a different NAME for 
this. I don’t know did someone change this name very recently. but for 0.9, it 
is the case . 

When I say satisfy, yes, if the executors have enough resources to run your 
tasks. all the tasks will be assigned out in your case for the registered 4 
nodes. though the other 16 executors probably will registered in a short while 
later. but when they got registered, all tasks have already been assigned out. 
Not sure is this exactly what happened in your cluster. But seems to me likely 
to be.

Your case might not  be that executors not run on all spark nodes, but they 
don’t get registered quick enough.



On 2014年10月20日, at 下午2:15, Tao Xiao xiaotao.cs@gmail.com wrote:

 Raymond,
 
 Thank you.
 
 But I read from other thread that PROCESS_LOCAL means the data is in the 
 same JVM as the code that is running. When data is in the same JVM with the 
 code that is running, the data should be on the same node as JVM, i.e., the 
 data can be said to be local. 
 
 Also you said that the tasks will be assigned to available executors which 
 satisfy the application's requirements. But what requirements must an 
 executor satisfy so that task can be assigned to it? Do you mean resources 
 (memory, CPU)? 
 
 Finally, is there any way to guarantee that all executors for an application 
 will be run on all Spark nodes when data to be processed is big enough (for 
 example, HBase data resides on all RegionServers) ? 
  
 
 
 
 2014-10-20 11:35 GMT+08:00 raymond rgbbones.m...@gmail.com:
 My best guess is the speed that your executors got registered with driver 
 differs between each run.
 
 when you run it for the first time, the executors is not fully registered 
 when task set manager start to assign tasks, and thus the tasks was assigned 
 to available executors which have already satisfy what you need ,say 86 with 
 one batch.
 
 And the “Process_local” does not necessary means that the data is local, it 
 could be that the executor is not available yet for the data source ( in your 
 case, might though will be available later).
 
 If this is the case, you could just sleep a few seconds before run the job. 
 or there are some patches related and providing other way to sync executors 
 status before running applications, but I haven’t track the related status 
 for a while.
 
 
 Raymond
 
 On 2014年10月20日, at 上午11:22, Tao Xiao xiaotao.cs@gmail.com wrote:
 
 Hi all, 
 
 I have a Spark-0.9 cluster, which has 16 nodes.
 
 I wrote a Spark application to read data from an HBase table, which has 86 
 regions spreading over 20 RegionServers.
 
 I submitted the Spark app in Spark standalone mode and found that there were 
 86 executors running on just 3 nodes and it took about  30 minutes to read 
 data from the table. In this case, I noticed from Spark master UI that 
 Locality Level of all executors are PROCESS_LOCAL. 
 
 Later I ran the same app again (without any code changed) and found that 
 those 86 executors were running on 16 nodes, and this time it took just 4 
 minutes to read date from the same HBase table. In this case, I noticed that 
 Locality Level of most executors are NODE_LOCAL. 
 
 After testing multiple times, I found the two cases above occur randomly. 
 
 So I have 2 questions: 
 1)  Why would the two cases above occur randomly when I submitted the same 
 application multiple times ?
 2)  Would the spread of executors influence locality level ?
 
 Thank you.
 
   
 
 



RE: RDDs

2014-09-04 Thread Liu, Raymond
Actually, a replicated RDD and a parallel job on the same RDD, this two 
conception is not related at all. 
A replicated RDD just store data on multiple node, it helps with HA and provide 
better chance for data locality. It is still one RDD, not two separate RDD.
While regarding run two jobs on the same RDD, it doesn't matter that the RDD is 
replicated or not. You can always do it if you wish to.


Best Regards,
Raymond Liu

-Original Message-
From: Kartheek.R [mailto:kartheek.m...@gmail.com] 
Sent: Thursday, September 04, 2014 1:24 PM
To: u...@spark.incubator.apache.org
Subject: RE: RDDs

Thank you Raymond and Tobias. 
Yeah, I am very clear about what I was asking. I was talking about replicated 
rdd only. Now that I've got my understanding about job and application 
validated, I wanted to know if we can replicate an rdd and run two jobs (that 
need same rdd) of an application in parallel?.

-Karthk




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: memory size for caching RDD

2014-09-04 Thread Liu, Raymond
You don’t need to. It is not static allocated to RDD cache, it is just an up 
limit.
If you don’t use up the memory by RDD cache, it is always available for other 
usage. except those one also controlled by some memoryFraction conf. e.g. 
spark.shuffle.memoryFraction which you also set the up limit.

Best Regards,
Raymond Liu

From: 牛兆捷 [mailto:nzjem...@gmail.com]
Sent: Thursday, September 04, 2014 2:27 PM
To: Patrick Wendell
Cc: user@spark.apache.org; d...@spark.apache.org
Subject: Re: memory size for caching RDD

But is it possible to make t resizable? When we don't have many RDD to cache, 
we can give some memory to others.

2014-09-04 13:45 GMT+08:00 Patrick Wendell 
pwend...@gmail.commailto:pwend...@gmail.com:
Changing this is not supported, it si immutable similar to other spark
configuration settings.

On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 
nzjem...@gmail.commailto:nzjem...@gmail.com wrote:
 Dear all:

 Spark uses memory to cache RDD and the memory size is specified by
 spark.storage.memoryFraction.

 One the Executor starts, does Spark support adjusting/resizing memory size
 of this part dynamically?

 Thanks.

 --
 *Regards,*
 *Zhaojie*



--
Regards,
Zhaojie



RE: memory size for caching RDD

2014-09-04 Thread Liu, Raymond
I think there is no public API available to do this. In this case, the best you 
can do might be unpersist some RDDs manually. The problem is that this is done 
by RDD unit, not by block unit. And then, if the storage level including disk 
level, the data on the disk will be removed too.

Best Regards,
Raymond Liu

From: 牛兆捷 [mailto:nzjem...@gmail.com] 
Sent: Thursday, September 04, 2014 2:57 PM
To: Liu, Raymond
Cc: Patrick Wendell; user@spark.apache.org; d...@spark.apache.org
Subject: Re: memory size for caching RDD

Oh I see. 

I want to implement something like this: sometimes I need to release some 
memory for other usage even when they are occupied by some RDDs (can be 
recomputed with the help of lineage when they are needed),  does spark provide 
interfaces to force it to release some memory ?

2014-09-04 14:32 GMT+08:00 Liu, Raymond raymond@intel.com:
You don’t need to. It is not static allocated to RDD cache, it is just an up 
limit.
If you don’t use up the memory by RDD cache, it is always available for other 
usage. except those one also controlled by some memoryFraction conf. e.g. 
spark.shuffle.memoryFraction which you also set the up limit.
 
Best Regards,
Raymond Liu
 
From: 牛兆捷 [mailto:nzjem...@gmail.com] 
Sent: Thursday, September 04, 2014 2:27 PM
To: Patrick Wendell
Cc: user@spark.apache.org; d...@spark.apache.org
Subject: Re: memory size for caching RDD
 
But is it possible to make t resizable? When we don't have many RDD to cache, 
we can give some memory to others.
 
2014-09-04 13:45 GMT+08:00 Patrick Wendell pwend...@gmail.com:
Changing this is not supported, it si immutable similar to other spark
configuration settings.

On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 nzjem...@gmail.com wrote:
 Dear all:

 Spark uses memory to cache RDD and the memory size is specified by
 spark.storage.memoryFraction.

 One the Executor starts, does Spark support adjusting/resizing memory size
 of this part dynamically?

 Thanks.

 --
 *Regards,*
 *Zhaojie*



-- 
Regards,
Zhaojie
 



-- 
Regards,
Zhaojie



RE: RDDs

2014-09-03 Thread Liu, Raymond
Not sure what did you refer to when saying replicated rdd, if you actually mean 
RDD, then, yes , read the API doc and paper as Tobias mentioned.
If you actually focus on the word replicated, then that is for fault 
tolerant, and probably mostly used in the streaming case for receiver created 
RDD.

For Spark, Application is your user program. And a job is an internal schedule 
conception, It's a group of some RDD operation. Your applications might invoke 
several jobs.


Best Regards,
Raymond Liu

From: rapelly kartheek [mailto:kartheek.m...@gmail.com] 
Sent: Wednesday, September 03, 2014 5:03 PM
To: user@spark.apache.org
Subject: RDDs

Hi,
Can someone tell me what kind of operations can be performed on a replicated 
rdd?? What are the use-cases of a replicated rdd.
One basic doubt that is bothering me from long time: what is the difference 
between an application and job in the Spark parlance. I am confused b'cas of 
Hadoop jargon.
Thank you


RE: resize memory size for caching RDD

2014-09-03 Thread Liu, Raymond
AFAIK, No.

Best Regards,
Raymond Liu

From: 牛兆捷 [mailto:nzjem...@gmail.com] 
Sent: Thursday, September 04, 2014 11:30 AM
To: user@spark.apache.org
Subject: resize memory size for caching RDD

Dear all:

Spark uses memory to cache RDD and the memory size is specified by 
spark.storage.memoryFraction.

One the Executor starts, does Spark support adjusting/resizing memory size of 
this part dynamically?

Thanks.

-- 
Regards,
Zhaojie


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: how to filter value in spark

2014-08-31 Thread Liu, Raymond
You could use cogroup to combine RDDs in one RDD for cross reference processing.

e.g.

a.cogroup(b). filter{case (_, (l,r)) = l.nonEmpty  r.nonEmpty }. map{case 
(k,(l,r)) = (k, l)}

Best Regards,
Raymond Liu

-Original Message-
From: marylucy [mailto:qaz163wsx_...@hotmail.com] 
Sent: Friday, August 29, 2014 9:26 PM
To: Matthew Farrellee
Cc: user@spark.apache.org
Subject: Re: how to filter value in spark

i see it works well,thank you!!!

But in follow situation how to do

var a = sc.textFile(/sparktest/1/).map((_,a))
var b = sc.textFile(/sparktest/2/).map((_,b))
How to get (3,a) and (4,a)


在 Aug 28, 2014,19:54,Matthew Farrellee m...@redhat.com 写道:

 On 08/28/2014 07:20 AM, marylucy wrote:
 fileA=1 2 3 4  one number a line,save in /sparktest/1/
 fileB=3 4 5 6  one number a line,save in /sparktest/2/ I want to get 
 3 and 4
 
 var a = sc.textFile(/sparktest/1/).map((_,1))
 var b = sc.textFile(/sparktest/2/).map((_,1))
 
 a.filter(param={b.lookup(param._1).length0}).map(_._1).foreach(prin
 tln)
 
 Error throw
 Scala.MatchError:Null
 PairRDDFunctions.lookup...
 
 the issue is nesting of the b rdd inside a transformation of the a rdd
 
 consider using intersection, it's more idiomatic
 
 a.intersection(b).foreach(println)
 
 but not that intersection will remove duplicates
 
 best,
 
 
 matt
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org
 
B�CB??[��X�剀�X�KK[XZ[
?\�\�][��X�剀�X�P?\���\X?KBY][��[圹[X[??K[XZ[
?\�\�Z[?\���\X?KB�B


RE: What is a Block Manager?

2014-08-27 Thread Liu, Raymond
The framework have those info to manage cluster status, and these info (e.g. 
worker number) is also available through spark metrics system.
While from the user application's point of view, can you give an example why 
you need these info, what would you plan to do with them?

Best Regards,
Raymond Liu

From: Victor Tso-Guillen [mailto:v...@paxata.com] 
Sent: Wednesday, August 27, 2014 1:40 PM
To: Liu, Raymond
Cc: user@spark.apache.org
Subject: Re: What is a Block Manager?

We're a single-app deployment so we want to launch as many executors as the 
system has workers. We accomplish this by not configuring the max for the 
application. However, is there really no way to inspect what machines/executor 
ids/number of workers/etc is available in context? I'd imagine that there'd be 
something in the SparkContext or in the listener, but all I see in the listener 
is block managers getting added and removed. Wouldn't one care about the 
workers getting added and removed at least as much as for block managers?

On Tue, Aug 26, 2014 at 6:58 PM, Liu, Raymond raymond@intel.com wrote:
Basically, a Block Manager manages the storage for most of the data in spark, 
name a few: block that represent a cached RDD partition, intermediate shuffle 
data, broadcast data etc. it is per executor, while in standalone mode, 
normally, you have one executor per worker.

You don't control how many worker you have at runtime, but you can somehow 
manage how many executors your application will launch  Check different running 
mode's documentation for details  ( but control where? Hardly, yarn mode did 
some works based on data locality, but this is done by framework not user 
program).

Best Regards,
Raymond Liu

From: Victor Tso-Guillen [mailto:v...@paxata.com]
Sent: Tuesday, August 26, 2014 11:42 PM
To: user@spark.apache.org
Subject: What is a Block Manager?

I'm curious not only about what they do, but what their relationship is to the 
rest of the system. I find that I get listener events for n block managers 
added where n is also the number of workers I have available to the 
application. Is this a stable constant?

Also, are there ways to determine at runtime how many workers I have and where 
they are?

Thanks,
Victor



RE: What is a Block Manager?

2014-08-26 Thread Liu, Raymond
Basically, a Block Manager manages the storage for most of the data in spark, 
name a few: block that represent a cached RDD partition, intermediate shuffle 
data, broadcast data etc. it is per executor, while in standalone mode, 
normally, you have one executor per worker.

You don't control how many worker you have at runtime, but you can somehow 
manage how many executors your application will launch  Check different running 
mode's documentation for details  ( but control where? Hardly, yarn mode did 
some works based on data locality, but this is done by framework not user 
program).

Best Regards,
Raymond Liu

From: Victor Tso-Guillen [mailto:v...@paxata.com] 
Sent: Tuesday, August 26, 2014 11:42 PM
To: user@spark.apache.org
Subject: What is a Block Manager?

I'm curious not only about what they do, but what their relationship is to the 
rest of the system. I find that I get listener events for n block managers 
added where n is also the number of workers I have available to the 
application. Is this a stable constant?

Also, are there ways to determine at runtime how many workers I have and where 
they are?

Thanks,
Victor

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Request for help in writing to Textfile

2014-08-25 Thread Liu, Raymond
You can try to manipulate the string you want to output before saveAsTextFile, 
something like

modify. flatMap(x=x).map{x=
 val s=x.toString
 s.subSequence(1,s.length-1)
   }

Should have more optimized way.

Best Regards,
Raymond Liu


-Original Message-
From: yh18190 [mailto:yh18...@gmail.com] 
Sent: Monday, August 25, 2014 9:57 PM
To: u...@spark.incubator.apache.org
Subject: Request for help in writing to Textfile

Hi Guys,

I am currently playing with huge data.I have an RDD which returns 
RDD[List[(tuples)]].I need only the tuples to be written to textfile output 
using saveAsTextFile function.
example:val mod=modify.saveASTextFile()  returns 

List((20140813,4,141127,3,HYPHLJLU,HY,KNGHWEB,USD,144.00,662.40,KY1),
(20140813,4,141127,3,HYPHLJLU,HY,DBLHWEB,USD,144.00,662.40,KY1))
List((20140813,4,141127,3,HYPHLJLU,HY,KNGHWEB,USD,144.00,662.40,KY1),
(20140813,4,141127,3,HYPHLJLU,HY,DBLHWEB,USD,144.00,662.40,KY1)

I need following output with only tuple values in a textfile.
20140813,4,141127,3,HYPHLJLU,HY,KNGHWEB,USD,144.00,662.40,KY1
20140813,4,141127,3,HYPHLJLU,HY,DBLHWEB,USD,144.00,662.40,KY1


Please let me know if anybody has anyidea regarding this without using
collect() function...Please help me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Request-for-help-in-writing-to-Textfile-tp12744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: About StorageLevel

2014-06-26 Thread Liu, Raymond
I think there is a shuffle stage involved. And the future count job will 
depends on the first job’s shuffle stages’s output data directly as long as it 
is still available. Thus it will be much faster.
Best Regards,
Raymond Liu

From: tomsheep...@gmail.com [mailto:tomsheep...@gmail.com]
Sent: Friday, June 27, 2014 10:08 AM
To: user
Subject: Re: About StorageLevel

Thank u Andrew, that's very helpful.
I still have some doubts on a simple trial: I opened a spark shell in local 
mode,
and typed in

val r=sc.parallelize(0 to 50)
val r2=r.keyBy(x=x).groupByKey(10)

and then I invoked the count action several times on it,

r2.count
(multiple times)

The first job obviously takes more time than the latter ones. Is there some 
magic underneath?

Regards,
Kang Liu

From: Andrew Ormailto:and...@databricks.com
Date: 2014-06-27 02:25
To: usermailto:user@spark.apache.org
Subject: Re: About StorageLevel
Hi Kang,

You raise a good point. Spark does not automatically cache all your RDDs. Why? 
Simply because the application may create many RDDs, and not all of them are to 
be reused. After all, there is only so much memory available to each executor, 
and caching an RDD adds some overhead especially if we have to kick out old 
blocks with LRU. As an example, say you run the following chain:

sc.textFile(...).map(...).filter(...).flatMap(...).map(...).reduceByKey(...).count()

You might be interested in reusing only the final result, but each step of the 
chain actually creates an RDD. If we automatically cache all RDDs, then we'll 
end up doing extra work for the RDDs we don't care about. The effect can be 
much worse if our RDDs are big and there are many of them, in which case there 
may be a lot of churn in the cache as we constantly evict RDDs we reuse. After 
all, the users know best what RDDs they are most interested in, so it makes 
sense to give them control over caching behavior.

Best,
Andrew


2014-06-26 5:36 GMT-07:00 tomsheep...@gmail.commailto:tomsheep...@gmail.com 
tomsheep...@gmail.commailto:tomsheep...@gmail.com:
Hi all,

I have a newbie question about StorageLevel of spark. I came up with these 
sentences in spark documents:


If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), 
leave them that way. This is the most CPU-efficient option, allowing operations 
on the RDDs to run as fast as possible.


And


Spark automatically monitors cache usage on each node and drops out old data 
partitions in a least-recently-used (LRU) fashion. If you would like to 
manually remove an RDD instead of waiting for it to fall out of the cache, use 
the RDD.unpersist() method.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

But I found the default storageLevel is NONE in source code, and if I never 
call 'persist(somelevel)', that value will always be NONE. The 'iterator' 
method goes to

final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
SparkEnv.get.cacheManager.getOrCompute(this, split, context, 
storageLevel)
} else {
computeOrReadCheckpoint(split, context)
}
}
Is that to say, the rdds are cached in memory (or somewhere else) if and only 
if the 'persist' or 'cache' method is called explicitly,
otherwise they will be re-computed every time even in an iterative situation?
It made me confused becase I had a first impression that spark is super-fast 
because it prefers to store intermediate results in memory automatically.


Forgive me if I asked a stupid question.


Regards,
Kang Liu



RE: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Liu, Raymond
If some task have no locality preference,  it will also show up as 
PROCESS_LOCAL, yet, I think we probably need to name it NO_PREFER to make it 
more clear. Not sure is this your case.

Best Regards,
Raymond Liu

From: coded...@gmail.com [mailto:coded...@gmail.com] On Behalf Of Sung Hwan 
Chung
Sent: Friday, June 06, 2014 6:53 AM
To: user@spark.apache.org
Subject: Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or 
RACK_LOCAL?

Additionally, I've encountered some confusing situation where the locality 
level for a task showed up as 'PROCESS_LOCAL' even though I didn't cache the 
data. I wonder some implicit caching happens even without the user specifying 
things.

On Thu, Jun 5, 2014 at 3:50 PM, Sung Hwan Chung 
coded...@cs.stanford.edumailto:coded...@cs.stanford.edu wrote:
Thanks Andrew,

Is there a chance that even with full-caching, that modes other than 
PROCESS_LOCAL will be used? E.g., let's say, an executor will try to perform 
tasks although the data are cached on a different executor.

What I'd like to do is to prevent such a scenario entirely.

I'd like to know if setting 'spark.locality.wait' to a very high value would 
guarantee that the mode will always be 'PROCESS_LOCAL'.

On Thu, Jun 5, 2014 at 3:36 PM, Andrew Ash 
and...@andrewash.commailto:and...@andrewash.com wrote:
The locality is how close the data is to the code that's processing it.  
PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's 
really fast.  NODE_LOCAL might mean that the data is in HDFS on the same node, 
or in another executor on the same node, so is a little slower because the data 
has to travel across an IPC connection.  RACK_LOCAL is even slower -- data is 
on a different server so needs to be sent over the network.

Spark switches to lower locality levels when there's no unprocessed data on a 
node that has idle CPUs.  In that situation you have two options: wait until 
the busy CPUs free up so you can start another task that uses data on that 
server, or start a new task on a farther away server that needs to bring data 
from that remote place.  What Spark typically does is wait a bit in the hopes 
that a busy CPU frees up.  Once that timeout expires, it starts moving the data 
from far away to the free CPU.

The main tunable option is how far long the scheduler waits before starting to 
move data rather than code.  Those are the spark.locality.* settings here: 
http://spark.apache.org/docs/latest/configuration.html

If you want to prevent this from happening entirely, you can set the values to 
ridiculously high numbers.  The documentation also mentions that 0 has 
special meaning, so you can try that as well.

Good luck!
Andrew

On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung 
coded...@cs.stanford.edumailto:coded...@cs.stanford.edu wrote:
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that 
this means fully cached) to NODE_LOCAL or even RACK_LOCAL.

When these happen things get extremely slow.

Does this mean that the executor got terminated and restarted?

Is there a way to prevent this from happening (barring the machine actually 
going down, I'd rather stick with the same process)?





RE: yarn-client mode question

2014-05-21 Thread Liu, Raymond
Seems you are asking that does spark related jar need to be deploy to yarn 
cluster manually before you launch application?
Then, no , you don't, just like other yarn application. And it doesn't matter 
it is yarn-client or yarn-cluster mode..


Best Regards,
Raymond Liu

-Original Message-
From: Sophia [mailto:sln-1...@163.com] 
Sent: Thursday, May 22, 2014 10:55 AM
To: u...@spark.incubator.apache.org
Subject: Re: yarn-client mode question

But,I don't understand this point,is it necessary to deploy slave node of spark 
in the yarn node? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6216.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: different in spark on yarn mode and standalone mode

2014-05-04 Thread Liu, Raymond
In the core, they are not quite different
In standalone mode, you have spark master and spark worker who allocate driver 
and executors for your spark app.
While in Yarn mode, Yarn resource manager and node manager do this work.
When the driver and executors have been launched, the rest part of resource 
scheduling go through the same process, say between driver and executor through 
akka actor.

Best Regards,
Raymond Liu


-Original Message-
From: Sophia [mailto:sln-1...@163.com] 

Hey you guys,
What is the different in spark on yarn mode and standalone mode about resource 
schedule?
Wish you happy everyday.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: How fast would you expect shuffle serialize to be?

2014-04-30 Thread Liu, Raymond
I just tried to use serializer to write object directly in local mode with code:

val datasize =  args(1).toInt
val dataset = (0 until datasize).map( i = (asmallstring, i))

val out: OutputStream = {
new BufferedOutputStream(new FileOutputStream(args(2)), 1024 * 100)
  }

val ser = SparkEnv.get.serializer.newInstance()
val serOut = ser.serializeStream(out)

dataset.foreach( value =
  serOut.writeObject(value)
)
serOut.flush()
serOut.close()

Thus one core on one disk. When using javaserializer, throughput is 10~12MB/s, 
and kryo doubles. So it seems to me that when running the full path code in my 
previous case, 32 core with 50MB/s total throughput are reasonable?


Best Regards,
Raymond Liu


-Original Message-
From: Liu, Raymond [mailto:raymond@intel.com] 


Later case, total throughput aggregated from all cores.

Best Regards,
Raymond Liu


-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Wednesday, April 30, 2014 1:22 PM
To: user@spark.apache.org
Subject: Re: How fast would you expect shuffle serialize to be?

Hm - I'm still not sure if you mean
100MB/s for each task = 3200MB/s across all cores
-or-
3.1MB/s for each task = 100MB/s across all cores

If it's the second one, that's really slow and something is wrong. If it's the 
first one this in the range of what I'd expect, but I'm no expert.

On Tue, Apr 29, 2014 at 10:14 PM, Liu, Raymond raymond@intel.com wrote:
 For all the tasks, say 32 task on total

 Best Regards,
 Raymond Liu


 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]

 Is this the serialization throughput per task or the serialization throughput 
 for all the tasks?

 On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote:
 Hi

 I am running a WordCount program which count words from HDFS, 
 and I noticed that the serializer part of code takes a lot of CPU 
 time. On a 16core/32thread node, the total throughput is around 
 50MB/s by JavaSerializer, and if I switching to KryoSerializer, it 
 doubles to around 100-150MB/s. ( I have 12 disks per node and files 
 scatter across disks, so HDFS BW is not a problem)

 And I also notice that, in this case, the object to write is 
 (String, Int), if I try some case with (int, int), the throughput will be 
 2-3x faster further.

 So, in my Wordcount case, the bottleneck is CPU ( cause if 
 with shuffle compress on, the 150MB/s data bandwidth in input side, 
 will usually lead to around 50MB/s shuffle data)

 This serialize BW looks somehow too low , so I am wondering, what's 
 BW you observe in your case? Does this throughput sounds reasonable to you? 
 If not, anything might possible need to be examined in my case?



 Best Regards,
 Raymond Liu




RE: Shuffle Spill Issue

2014-04-29 Thread Liu, Raymond
Hi Daniel

Thanks for your reply, While I think for reduceByKey, it will also do 
map side combine, thus extra the result is the same, say, for each partition, 
one entry per distinct word. In my case with javaserializer,  240MB dataset 
yield to around 70MB shuffle data. Only that shuffle Spill ( memory ) is 
abnormal, and sounds to me should not trigger at all. And, by the way, this 
behavior only occurs in map out side, on reduce / shuffle fetch side, this 
strange behavior won't happen.

Best Regards,
Raymond Liu

From: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com] 

I have no idea why shuffle spill is so large. But this might make it smaller:

val addition = (a: Int, b: Int) = a + b
val wordsCount = wordsPair.combineByKey(identity, addition, addition)

This way only one entry per distinct word will end up in the shuffle for each 
partition, instead of one entry per word occurrence.

On Tue, Apr 29, 2014 at 7:48 AM, Liu, Raymond raymond@intel.com wrote:
Hi  Patrick

        I am just doing simple word count , the data is generated by hadoop 
random text writer.

        This seems to me not quite related to compress , If I turn off compress 
on shuffle, the metrics is something like below for the smaller 240MB Dataset.


Executor ID     Address Task Time       Total Tasks     Failed Tasks    
Succeeded Tasks Shuffle Read    Shuffle Write   Shuffle Spill (Memory)  Shuffle 
Spill (Disk)
10      sr437:48527     35 s    8       0       8       0.0 B   2.5 MB  2.2 GB  
1291.2 KB
12      sr437:46077     34 s    8       0       8       0.0 B   2.5 MB  1822.6 
MB       1073.3 KB
13      sr434:37896     31 s    8       0       8       0.0 B   2.4 MB  1099.2 
MB       621.2 KB
15      sr438:52819     31 s    8       0       8       0.0 B   2.5 MB  1898.8 
MB       1072.6 KB
16      sr434:37103     32 s    8       0       8       0.0 B   2.4 MB  1638.0 
MB       1044.6 KB


        And the program pretty simple:

val files = sc.textFile(args(1))
val words = files.flatMap(_.split( ))
val wordsPair = words.map(x = (x, 1))

val wordsCount = wordsPair.reduceByKey(_ + _)
val count = wordsCount.count()

println(Number of words =  + count)


Best Regards,
Raymond Liu

From: Patrick Wendell [mailto:pwend...@gmail.com]

Could you explain more what your job is doing and what data types you are 
using? These numbers alone don't necessarily indicate something is wrong. The 
relationship between the in-memory and on-disk shuffle amount is definitely a 
bit strange, the data gets compressed when written to disk, but unless you have 
a weird dataset (E.g. all zeros) I wouldn't expect it to compress _that_ much.

On Mon, Apr 28, 2014 at 1:18 AM, Liu, Raymond raymond@intel.com wrote:
Hi


        I am running a simple word count program on spark standalone cluster. 
The cluster is made up of 6 node, each run 4 worker and each worker own 10G 
memory and 16 core thus total 96 core and 240G memory. ( well, also used to 
configed as 1 worker with 40G memory on each node )

        I run a very small data set (2.4GB on HDFS on total) to confirm the 
problem here as below:

        As you can read from part of the task metrics as below, I noticed that 
the shuffle spill part of metrics indicate that there are something wrong.

Executor ID     Address Task Time       Total Tasks     Failed Tasks    
Succeeded Tasks Shuffle Read    Shuffle Write   Shuffle Spill (Memory)  Shuffle 
Spill (Disk)
0       sr437:42139     29 s    4       0       4       0.0 B   4.3 MB  23.6 GB 
4.3 MB
1       sr433:46935     1.1 min 4       0       4       0.0 B   4.2 MB  19.0 GB 
3.4 MB
10      sr436:53277     26 s    4       0       4       0.0 B   4.3 MB  25.6 GB 
4.6 MB
11      sr437:58872     32 s    4       0       4       0.0 B   4.3 MB  25.0 GB 
4.4 MB
12      sr435:48358     27 s    4       0       4       0.0 B   4.3 MB  25.1 GB 
4.4 MB


You can see that the Shuffle Spill (Memory) is pretty high, almost 5000x of the 
actual shuffle data and Shuffle Spill (Disk), and also it seems to me that by 
no means that the spill should trigger, since the memory is not used up at all.

To verify that I further reduce the data size to 240MB on total

And here is the result:


Executor ID     Address Task Time       Total Tasks     Failed Tasks    
Succeeded Tasks Shuffle Read    Shuffle Write   Shuffle Spill (Memory)  Shuffle 
Spill (Disk)
0       sr437:50895     15 s    4       0       4       0.0 B   703.0 KB        
80.0 MB 43.2 KB
1       sr433:50207     17 s    4       0       4       0.0 B   704.7 KB        
389.5 MB        90.2 KB
10      sr436:56352     16 s    4       0       4       0.0 B   700.9 KB        
814.9 MB        181.6 KB
11      sr437:53099     15 s    4       0       4       0.0 B   689.7 KB        
0.0 B   0.0 B
12      sr435:48318     15 s    4       0       4       0.0 B   702.1 KB        
427.4 MB        90.7 KB
13      sr433:59294     17 s    4       0       4       0.0 B   704.8 KB        
779.9 MB

How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
Hi

I am running a WordCount program which count words from HDFS, and I 
noticed that the serializer part of code takes a lot of CPU time. On a 
16core/32thread node, the total throughput is around 50MB/s by JavaSerializer, 
and if I switching to KryoSerializer, it doubles to around 100-150MB/s. ( I 
have 12 disks per node and files scatter across disks, so HDFS BW is not a 
problem)

And I also notice that, in this case, the object to write is (String, 
Int), if I try some case with (int, int), the throughput will be 2-3x faster 
further.

So, in my Wordcount case, the bottleneck is CPU ( cause if with shuffle 
compress on, the 150MB/s data bandwidth in input side, will usually lead to 
around 50MB/s shuffle data)

This serialize BW looks somehow too low , so I am wondering, what's BW 
you observe in your case? Does this throughput sounds reasonable to you? If 
not, anything might possible need to be examined in my case?



Best Regards,
Raymond Liu




RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
For all the tasks, say 32 task on total

Best Regards,
Raymond Liu


-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 

Is this the serialization throughput per task or the serialization throughput 
for all the tasks?

On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote:
 Hi

 I am running a WordCount program which count words from HDFS, 
 and I noticed that the serializer part of code takes a lot of CPU 
 time. On a 16core/32thread node, the total throughput is around 50MB/s 
 by JavaSerializer, and if I switching to KryoSerializer, it doubles to 
 around 100-150MB/s. ( I have 12 disks per node and files scatter 
 across disks, so HDFS BW is not a problem)

 And I also notice that, in this case, the object to write is (String, 
 Int), if I try some case with (int, int), the throughput will be 2-3x faster 
 further.

 So, in my Wordcount case, the bottleneck is CPU ( cause if 
 with shuffle compress on, the 150MB/s data bandwidth in input side, 
 will usually lead to around 50MB/s shuffle data)

 This serialize BW looks somehow too low , so I am wondering, what's 
 BW you observe in your case? Does this throughput sounds reasonable to you? 
 If not, anything might possible need to be examined in my case?



 Best Regards,
 Raymond Liu




RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
By the way, to be clear, I run repartition firstly to make all data go through 
shuffle instead of run ReduceByKey etc directly ( which reduce the data need to 
be shuffle and serialized), thus say all 50MB/s data from HDFS will go to 
serializer. ( in fact, I also tried generate data in memory directly instead of 
read from HDFS, similar throughput result)

Best Regards,
Raymond Liu


-Original Message-
From: Liu, Raymond [mailto:raymond@intel.com] 

For all the tasks, say 32 task on total

Best Regards,
Raymond Liu


-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 

Is this the serialization throughput per task or the serialization throughput 
for all the tasks?

On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote:
 Hi

 I am running a WordCount program which count words from HDFS, 
 and I noticed that the serializer part of code takes a lot of CPU 
 time. On a 16core/32thread node, the total throughput is around 50MB/s 
 by JavaSerializer, and if I switching to KryoSerializer, it doubles to 
 around 100-150MB/s. ( I have 12 disks per node and files scatter 
 across disks, so HDFS BW is not a problem)

 And I also notice that, in this case, the object to write is (String, 
 Int), if I try some case with (int, int), the throughput will be 2-3x faster 
 further.

 So, in my Wordcount case, the bottleneck is CPU ( cause if 
 with shuffle compress on, the 150MB/s data bandwidth in input side, 
 will usually lead to around 50MB/s shuffle data)

 This serialize BW looks somehow too low , so I am wondering, what's 
 BW you observe in your case? Does this throughput sounds reasonable to you? 
 If not, anything might possible need to be examined in my case?



 Best Regards,
 Raymond Liu




RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
Later case, total throughput aggregated from all cores.

Best Regards,
Raymond Liu


-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Wednesday, April 30, 2014 1:22 PM
To: user@spark.apache.org
Subject: Re: How fast would you expect shuffle serialize to be?

Hm - I'm still not sure if you mean
100MB/s for each task = 3200MB/s across all cores
-or-
3.1MB/s for each task = 100MB/s across all cores

If it's the second one, that's really slow and something is wrong. If it's the 
first one this in the range of what I'd expect, but I'm no expert.

On Tue, Apr 29, 2014 at 10:14 PM, Liu, Raymond raymond@intel.com wrote:
 For all the tasks, say 32 task on total

 Best Regards,
 Raymond Liu


 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]

 Is this the serialization throughput per task or the serialization throughput 
 for all the tasks?

 On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote:
 Hi

 I am running a WordCount program which count words from HDFS, 
 and I noticed that the serializer part of code takes a lot of CPU 
 time. On a 16core/32thread node, the total throughput is around 
 50MB/s by JavaSerializer, and if I switching to KryoSerializer, it 
 doubles to around 100-150MB/s. ( I have 12 disks per node and files 
 scatter across disks, so HDFS BW is not a problem)

 And I also notice that, in this case, the object to write is 
 (String, Int), if I try some case with (int, int), the throughput will be 
 2-3x faster further.

 So, in my Wordcount case, the bottleneck is CPU ( cause if 
 with shuffle compress on, the 150MB/s data bandwidth in input side, 
 will usually lead to around 50MB/s shuffle data)

 This serialize BW looks somehow too low , so I am wondering, what's 
 BW you observe in your case? Does this throughput sounds reasonable to you? 
 If not, anything might possible need to be examined in my case?



 Best Regards,
 Raymond Liu




Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi


I am running a simple word count program on spark standalone cluster. 
The cluster is made up of 6 node, each run 4 worker and each worker own 10G 
memory and 16 core thus total 96 core and 240G memory. ( well, also used to 
configed as 1 worker with 40G memory on each node )

I run a very small data set (2.4GB on HDFS on total) to confirm the 
problem here as below:

As you can read from part of the task metrics as below, I noticed that 
the shuffle spill part of metrics indicate that there are something wrong.

Executor ID Address Task Time   Total Tasks Failed Tasks
Succeeded Tasks Shuffle ReadShuffle Write   Shuffle Spill (Memory)  Shuffle 
Spill (Disk)
0   sr437:42139 29 s4   0   4   0.0 B   4.3 MB  23.6 GB 
4.3 MB
1   sr433:46935 1.1 min 4   0   4   0.0 B   4.2 MB  19.0 GB 
3.4 MB
10  sr436:53277 26 s4   0   4   0.0 B   4.3 MB  25.6 GB 
4.6 MB
11  sr437:58872 32 s4   0   4   0.0 B   4.3 MB  25.0 GB 
4.4 MB
12  sr435:48358 27 s4   0   4   0.0 B   4.3 MB  25.1 GB 
4.4 MB


You can see that the Shuffle Spill (Memory) is pretty high, almost 5000x of the 
actual shuffle data and Shuffle Spill (Disk), and also it seems to me that by 
no means that the spill should trigger, since the memory is not used up at all.

To verify that I further reduce the data size to 240MB on total

And here is the result:


Executor ID Address Task Time   Total Tasks Failed Tasks
Succeeded Tasks Shuffle ReadShuffle Write   Shuffle Spill (Memory)  Shuffle 
Spill (Disk)
0   sr437:50895 15 s4   0   4   0.0 B   703.0 KB
80.0 MB 43.2 KB
1   sr433:50207 17 s4   0   4   0.0 B   704.7 KB
389.5 MB90.2 KB
10  sr436:56352 16 s4   0   4   0.0 B   700.9 KB
814.9 MB181.6 KB
11  sr437:53099 15 s4   0   4   0.0 B   689.7 KB
0.0 B   0.0 B
12  sr435:48318 15 s4   0   4   0.0 B   702.1 KB
427.4 MB90.7 KB
13  sr433:59294 17 s4   0   4   0.0 B   704.8 KB
779.9 MB180.3 KB

Nothing prevent spill from happening.

Now, there seems to me that there must be something wrong with the spill 
trigger codes. 

So anyone encounter this issue?  By the way, I am using latest trunk code.


Best Regards,
Raymond Liu


RE: Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi  Patrick

I am just doing simple word count , the data is generated by hadoop 
random text writer.

This seems to me not quite related to compress , If I turn off compress 
on shuffle, the metrics is something like below for the smaller 240MB Dataset.


Executor ID Address Task Time   Total Tasks Failed Tasks
Succeeded Tasks Shuffle ReadShuffle Write   Shuffle Spill (Memory)  Shuffle 
Spill (Disk)
10  sr437:48527 35 s8   0   8   0.0 B   2.5 MB  2.2 GB  
1291.2 KB
12  sr437:46077 34 s8   0   8   0.0 B   2.5 MB  1822.6 
MB   1073.3 KB
13  sr434:37896 31 s8   0   8   0.0 B   2.4 MB  1099.2 
MB   621.2 KB
15  sr438:52819 31 s8   0   8   0.0 B   2.5 MB  1898.8 
MB   1072.6 KB
16  sr434:37103 32 s8   0   8   0.0 B   2.4 MB  1638.0 
MB   1044.6 KB


And the program pretty simple:

val files = sc.textFile(args(1))
val words = files.flatMap(_.split( ))
val wordsPair = words.map(x = (x, 1))

val wordsCount = wordsPair.reduceByKey(_ + _)
val count = wordsCount.count()

println(Number of words =  + count)


Best Regards,
Raymond Liu

From: Patrick Wendell [mailto:pwend...@gmail.com] 

Could you explain more what your job is doing and what data types you are 
using? These numbers alone don't necessarily indicate something is wrong. The 
relationship between the in-memory and on-disk shuffle amount is definitely a 
bit strange, the data gets compressed when written to disk, but unless you have 
a weird dataset (E.g. all zeros) I wouldn't expect it to compress _that_ much.

On Mon, Apr 28, 2014 at 1:18 AM, Liu, Raymond raymond@intel.com wrote:
Hi


        I am running a simple word count program on spark standalone cluster. 
The cluster is made up of 6 node, each run 4 worker and each worker own 10G 
memory and 16 core thus total 96 core and 240G memory. ( well, also used to 
configed as 1 worker with 40G memory on each node )

        I run a very small data set (2.4GB on HDFS on total) to confirm the 
problem here as below:

        As you can read from part of the task metrics as below, I noticed that 
the shuffle spill part of metrics indicate that there are something wrong.

Executor ID     Address Task Time       Total Tasks     Failed Tasks    
Succeeded Tasks Shuffle Read    Shuffle Write   Shuffle Spill (Memory)  Shuffle 
Spill (Disk)
0       sr437:42139     29 s    4       0       4       0.0 B   4.3 MB  23.6 GB 
4.3 MB
1       sr433:46935     1.1 min 4       0       4       0.0 B   4.2 MB  19.0 GB 
3.4 MB
10      sr436:53277     26 s    4       0       4       0.0 B   4.3 MB  25.6 GB 
4.6 MB
11      sr437:58872     32 s    4       0       4       0.0 B   4.3 MB  25.0 GB 
4.4 MB
12      sr435:48358     27 s    4       0       4       0.0 B   4.3 MB  25.1 GB 
4.4 MB


You can see that the Shuffle Spill (Memory) is pretty high, almost 5000x of the 
actual shuffle data and Shuffle Spill (Disk), and also it seems to me that by 
no means that the spill should trigger, since the memory is not used up at all.

To verify that I further reduce the data size to 240MB on total

And here is the result:


Executor ID     Address Task Time       Total Tasks     Failed Tasks    
Succeeded Tasks Shuffle Read    Shuffle Write   Shuffle Spill (Memory)  Shuffle 
Spill (Disk)
0       sr437:50895     15 s    4       0       4       0.0 B   703.0 KB        
80.0 MB 43.2 KB
1       sr433:50207     17 s    4       0       4       0.0 B   704.7 KB        
389.5 MB        90.2 KB
10      sr436:56352     16 s    4       0       4       0.0 B   700.9 KB        
814.9 MB        181.6 KB
11      sr437:53099     15 s    4       0       4       0.0 B   689.7 KB        
0.0 B   0.0 B
12      sr435:48318     15 s    4       0       4       0.0 B   702.1 KB        
427.4 MB        90.7 KB
13      sr433:59294     17 s    4       0       4       0.0 B   704.8 KB        
779.9 MB        180.3 KB

Nothing prevent spill from happening.

Now, there seems to me that there must be something wrong with the spill 
trigger codes.

So anyone encounter this issue?  By the way, I am using latest trunk code.


Best Regards,
Raymond Liu