[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. In spark-default.xml, configure spark.hierarchyStore.
{code}
spark.hierarchyStore nvm 50GB, ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*:
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*:
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. In spark-default.xml, configure spark.hierarchyStore.
{code}
spark.hierarchyStore nvm 50GB, ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first, 
when nvm's usable space is less than 50GB, it starts to allocate from ssd.
when ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.


> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with 

[jira] [Commented] (SPARK-12231) Failed to generate predicate Error when using dropna

2015-12-09 Thread yahsuan, chang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048260#comment-15048260
 ] 

yahsuan, chang commented on SPARK-12231:


the following code won't throw exception on my environment

{code}
import pyspark
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.range(10)
df1 = df.withColumn('a', df['id'] * 2)
df1.write.parquet('./data')

df2 = sqlc.read.parquet('./data')
print df2.dropna().count()
{code}

I use spark-1.5.2-bin-hadoop2.6/bin/spark-submit to execute the code.

Thanks for your quick reply

> Failed to generate predicate Error when using dropna
> 
>
> Key: SPARK-12231
> URL: https://issues.apache.org/jira/browse/SPARK-12231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
> Environment: python version: 2.7.9
> os: ubuntu 14.04
>Reporter: yahsuan, chang
>
> code to reproduce error
> # write.py
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
> # read.py
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
> $ spark-submit write.py
> $ spark-submit read.py
> # error message
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to 
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: a#0L
> ...
> If write data without partitionBy, the error won't happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2015-12-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048286#comment-15048286
 ] 

Reynold Xin commented on SPARK-12225:
-

Why is this necessary? It seems difficult to design an API for this, and it is 
trivial for users to do it themselves.


> Support adding or replacing multiple columns at once in DataFrame API
> -
>
> Key: SPARK-12225
> URL: https://issues.apache.org/jira/browse/SPARK-12225
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> Currently, withColumn() method of DataFrame supports adding or replacing only 
> single column. It would be convenient to support adding or replacing multiple 
> columns at once.
> Also withColumnRenamed() supports renaming only single column.It would also 
> be convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2015-12-09 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048432#comment-15048432
 ] 

Sun Rui commented on SPARK-12225:
-

[~r...@databricks.com] 
This comes from supporting mutate() function in SparkR. mutate() is a function 
in the popular data manipulation dplyr package, which allows adding and 
replacing multiple columns in a call.

The implementation can't be as simple as calling withColumn() multiple times, 
as each call to withColumn() returns a new DataFrame, while the remaining 
columns to be added or for replacement are Columns referring to the old 
DataFrame. So the implementation is not trivial, and you can refer to 
https://github.com/apache/spark/pull/10220.

It would simplify the SparkR's implementation if there is a multiple-column 
version of withColumn.

Another motivation for this JIRA is that we just added drop() for multiple 
columns. So from API parity's point of view, it would be good to have a 
multiple-column version of withColumn.


> Support adding or replacing multiple columns at once in DataFrame API
> -
>
> Key: SPARK-12225
> URL: https://issues.apache.org/jira/browse/SPARK-12225
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> Currently, withColumn() method of DataFrame supports adding or replacing only 
> single column. It would be convenient to support adding or replacing multiple 
> columns at once.
> Also withColumnRenamed() supports renaming only single column.It would also 
> be convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.

2015-12-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-11621.

   Resolution: Fixed
Fix Version/s: 1.6.0

> ORC filter pushdown not working properly after new unhandled filter interface.
> --
>
> Key: SPARK-11621
> URL: https://issues.apache.org/jira/browse/SPARK-11621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
> Fix For: 1.6.0
>
>
> After the new interface to get rid of filters predicate-push-downed which are 
> already processed in datasource-level 
> (https://github.com/apache/spark/pull/9399), it dose not push down filters 
> for ORC.
> This is because at {{DataSourceStrategy}}, all the filters are treated as 
> unhandled filters.
> Also, since ORC does not support to filter fully record by record but instead 
> rough results came out, the filters for ORC should not go to unhandled 
> filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8338) Ganglia fails to start

2015-12-09 Thread Michael Reilly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048294#comment-15048294
 ] 

Michael Reilly commented on SPARK-8338:
---

This issue appears to have resurfaced in 1.5.2.

I'm creating a cluster in ec2 with spark-1.5.2-bin-hadoop2.6.

I issue a command like this

 ./spark-ec2 --key-pair=spark-sentiment --identity-file=/home/ec2-user/xxx.pem 
--spark-version=1.5.2 --region=eu-west-1 --zone=eu-west-1b --user-data=java8.sh 
launch spark-cluster-test --vpc-id=myvpc --subnet-id=mysubnet

(java8.sh just installs java8 on the cluster nodes)

After about ten minutes it finishes. The last few lines of output are:

Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...
ec2-52-30-193-79.eu-west-1.compute.amazonaws.com
Shutting down GANGLIA gmond:   [FAILED]
Starting GANGLIA gmond:[  OK  ]
Shutting down GANGLIA gmond:   [FAILED]
Starting GANGLIA gmond:[  OK  ]
Connection to ec2-52-30-193-79.eu-west-1.compute.amazonaws.com closed.
Shutting down GANGLIA gmetad:  [FAILED]
Starting GANGLIA gmetad:   [  OK  ]
Stopping httpd:[FAILED]
Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: 
Cannot load /etc/httpd/modules/mod_authz_core.so into server: 
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such 
file or directory
   [FAILED]
[timing] ganglia setup:  00h 00m 02s
Connection to ec2-52-31-xxx-xxx.eu-west-1.compute.amazonaws.com closed.
Spark standalone cluster started at 
http://ec2-52-31-xxx-xxx.eu-west-1.compute.amazonaws.com:8080
Ganglia started at 
http://ec2-52-31-xxx-xxx.eu-west-1.compute.amazonaws.com:5080/ganglia
Done!


> Ganglia fails to start
> --
>
> Key: SPARK-8338
> URL: https://issues.apache.org/jira/browse/SPARK-8338
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Vladimir Vladimirov
>Assignee: Vladimir Vladimirov
>Priority: Minor
> Fix For: 1.5.0
>
>
> Exception
> {code}
> Starting httpd: httpd: Syntax error on line 154 of 
> /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so 
> into server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object 
> file: No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*:
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*:
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.


  was:
*Problem*:
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is low. HDDs have good capacity, but 
x2-x3 lower than SSDs.
How can we get both good?

*Solution*:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*:
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than x1.86 (it could 
be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.



> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*:
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*:
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*:
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12238) s/Advanced sources/External Sources in docs.

2015-12-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12238:


Assignee: (was: Apache Spark)

> s/Advanced sources/External Sources in docs.
> 
>
> Key: SPARK-12238
> URL: https://issues.apache.org/jira/browse/SPARK-12238
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Prashant Sharma
>
> While reading the docs, I felt reading as external sources(instead of 
> Advanced sources) seemed more appropriate as in they belong outside streaming 
> core project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*:
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*:
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. In spark-default.xml, configure spark.hierarchyStore.
{code}
spark.hierarchyStore nvm 50GB, ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first, 
when nvm's usable space is less than 50GB, it starts to allocate from ssd.
when ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*:
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*:
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.



> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*:
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*:
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*:
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> 

[jira] [Commented] (SPARK-12238) s/Advanced sources/External Sources in docs.

2015-12-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048365#comment-15048365
 ] 

Apache Spark commented on SPARK-12238:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/10223

> s/Advanced sources/External Sources in docs.
> 
>
> Key: SPARK-12238
> URL: https://issues.apache.org/jira/browse/SPARK-12238
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Prashant Sharma
>
> While reading the docs, I felt reading as external sources(instead of 
> Advanced sources) seemed more appropriate as in they belong outside streaming 
> core project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12237) Unsupported message RpcMessage causes message retries

2015-12-09 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-12237:
---

 Summary: Unsupported message RpcMessage causes message retries
 Key: SPARK-12237
 URL: https://issues.apache.org/jira/browse/SPARK-12237
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Jacek Laskowski


When an unsupported message is sent to an endpoint, Spark throws 
{{org.apache.spark.SparkException}} and retries sending the message. It should 
*not* since the message is unsupported.

{code}
WARN NettyRpcEndpointRef: Error sending message [message = RetrieveSparkProps] 
in 1 attempts
org.apache.spark.SparkException: Unsupported message 
RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@c0a6275)
 from localhost:51137
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
at 
org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
WARN NettyRpcEndpointRef: Error sending message [message = RetrieveSparkProps] 
in 2 attempts
org.apache.spark.SparkException: Unsupported message 
RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@73a76a5a)
 from localhost:51137
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
at 
org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
WARN NettyRpcEndpointRef: Error sending message [message = RetrieveSparkProps] 
in 3 attempts
org.apache.spark.SparkException: Unsupported message 
RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@670bfda7)
 from localhost:51137
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
at 
org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:253)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Error sending message [message = 
RetrieveSparkProps]
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
at 

[jira] [Closed] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu closed SPARK-12233.
---
Resolution: Not A Problem

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at 

[jira] [Issue Comment Deleted] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated SPARK-12233:

Comment: was deleted

(was: I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks)

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2015-12-09 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048312#comment-15048312
 ] 

Alok Singh commented on SPARK-12172:


+1 San Rui

having RDD api in sparkR
1) lets people do low level operations  if they want.
2) Also user can run custom mapper and reducer  (for example one may decide to 
write custom ML routine in sparkR using dataframe and RDD, rather than 
modifying mllib )
3) if someone want to enhance sparkR for there internal usages or for their 
customers but probably it doesn't make sense for sparkR community to add it. 
He/She can make the custom R package on top of sparkR package.

just some thoughts..
thanks
Alok

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048283#comment-15048283
 ] 

Apache Spark commented on SPARK-11965:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10222

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11965:


Assignee: (was: Apache Spark)

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11965:


Assignee: Apache Spark

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12238) s/Advanced sources/External Sources in docs.

2015-12-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12238:


Assignee: Apache Spark

> s/Advanced sources/External Sources in docs.
> 
>
> Key: SPARK-12238
> URL: https://issues.apache.org/jira/browse/SPARK-12238
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>
> While reading the docs, I felt reading as external sources(instead of 
> Advanced sources) seemed more appropriate as in they belong outside streaming 
> core project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12239) ​SparkR - Not distributing SparkR module in YARN

2015-12-09 Thread Sebastian YEPES FERNANDEZ (JIRA)
Sebastian YEPES FERNANDEZ created SPARK-12239:
-

 Summary: ​SparkR - Not distributing SparkR module in YARN
 Key: SPARK-12239
 URL: https://issues.apache.org/jira/browse/SPARK-12239
 Project: Spark
  Issue Type: Bug
  Components: SparkR, YARN
Affects Versions: 1.5.2, 1.5.3
Reporter: Sebastian YEPES FERNANDEZ
Priority: Critical


Hello,

I am trying to use the SparkR in a YARN environment and I have encountered the 
following problem:

Every thing work correctly when using bin/sparkR, but if I try running the same 
jobs using sparkR directly through R it does not work.

I have managed to track down what is causing the problem, when sparkR is 
launched through R the "SparkR" module is not distributed to the worker nodes.
I have tried working around this issue using the setting 
"spark.yarn.dist.archives", but it does not work as it deploys the 
file/extracted folder with the extension ".zip" and workers are actually 
looking for a folder with the name "sparkr"


Is there currently any way to make this work?



{code}
# spark-defaults.conf
spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip

# R
library(SparkR, lib.loc="/opt/apps/spark/R/lib/")
sc <- sparkR.init(appName="SparkR", master="yarn-client", 
sparkEnvir=list(spark.executor.instances="1"))
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, faithful)
head(df)

15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
{code}


Container stderr:
{code}
15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as values 
in memory (estimated size 8.7 KB, free 530.0 MB)
15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file 
'/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R':
 No such file or directory
15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)
{code}


Worker Node that runned the Container:
{code}
# ls -la 
/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02
total 71M
drwx--x--- 3 yarn hadoop 4.0K Dec  9 09:04 .
drwx--x--- 7 yarn hadoop 4.0K Dec  9 09:04 ..
-rw-r--r-- 1 yarn hadoop  110 Dec  9 09:03 container_tokens
-rw-r--r-- 1 yarn hadoop   12 Dec  9 09:03 .container_tokens.crc
-rwx-- 1 yarn hadoop  736 Dec  9 09:03 default_container_executor_session.sh
-rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 
.default_container_executor_session.sh.crc
-rwx-- 1 yarn hadoop  790 Dec  9 09:03 default_container_executor.sh
-rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 .default_container_executor.sh.crc
-rwxr-xr-x 1 yarn hadoop  61K Dec  9 09:04 hadoop-lzo-0.6.0.2.3.2.0-2950.jar
-rwxr-xr-x 1 yarn hadoop 317K Dec  9 09:04 kafka-clients-0.8.2.2.jar
-rwx-- 1 yarn hadoop 6.0K Dec  9 09:03 launch_container.sh
-rw-r--r-- 1 yarn hadoop   56 Dec  9 09:03 .launch_container.sh.crc
-rwxr-xr-x 1 yarn hadoop 2.2M Dec  9 09:04 
spark-cassandra-connector_2.10-1.5.0-M3.jar
-rwxr-xr-x 1 yarn hadoop 7.1M Dec  9 09:04 spark-csv-assembly-1.3.0.jar
lrwxrwxrwx 1 yarn hadoop  119 Dec  9 09:03 __spark__.jar -> 
/hadoop/hdfs/disk03/hadoop/yarn/local/usercache/spark/filecache/361/spark-assembly-1.5.3-SNAPSHOT-hadoop2.7.1.jar
lrwxrwxrwx 1 yarn hadoop   84 Dec  9 09:03 sparkr.zip -> 
/hadoop/hdfs/disk01/hadoop/yarn/local/usercache/spark/filecache/359/sparkr.zip
-rwxr-xr-x 1 yarn hadoop 1.8M Dec  9 09:04 
spark-streaming_2.10-1.5.3-SNAPSHOT.jar
-rwxr-xr-x 1 yarn hadoop  11M Dec  9 09:04 
spark-streaming-kafka-assembly_2.10-1.5.3-SNAPSHOT.jar
-rwxr-xr-x 1 yarn hadoop  48M Dec  9 09:04 
sparkts-0.1.0-SNAPSHOT-jar-with-dependencies.jar
drwx--x--- 2 yarn hadoop   46 Dec  9 09:04 tmp
{code}


*Working case:*
{code}
# sparkR --master yarn-client --num-executors 1

df <- createDataFrame(sqlContext, faithful)
head(df)
  eruptions waiting
1 3.600  79
2 1.800  54
3 3.333  74
4 2.283  62
5 4.533  85
6 2.883  55
{code}

Worker Node that runned the Container:
{code}
# ls -la 

[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048256#comment-15048256
 ] 

Xiao Li commented on SPARK-12218:
-

That is what I did in my environment. 

{code}
val df1 = Seq(1, 2, 3).map(i => (i, i, i.toString)).toDF("int1", "int2", 
"str")
df1.where("not (int1=0 and int2=0)").explain(true)
df1.where("not(int1=0) or not(int2=0)").explain(true)
{code}

Their physical plans are exactly the same. 

{code}
== Parsed Logical Plan ==
'Filter NOT (('int1 = 0) && ('int2 = 0))
+- Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5]
   +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]]

== Analyzed Logical Plan ==
int1: int, int2: int, str: string
Filter NOT ((int1#3 = 0) && (int2#4 = 0))
+- Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5]
   +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]]

== Optimized Logical Plan ==
Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5]
+- Filter (NOT (_1#0 = 0) || NOT (_2#1 = 0))
   +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]]

== Physical Plan ==
Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5]
+- Filter (NOT (_1#0 = 0) || NOT (_2#1 = 0))
   +- LocalTableScan [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]]
{code}

{code}
== Parsed Logical Plan ==
'Filter (NOT ('int1 = 0) || NOT ('int2 = 0))
+- Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5]
   +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]]

== Analyzed Logical Plan ==
int1: int, int2: int, str: string
Filter (NOT (int1#3 = 0) || NOT (int2#4 = 0))
+- Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5]
   +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]]

== Optimized Logical Plan ==
Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5]
+- Filter (NOT (_1#0 = 0) || NOT (_2#1 = 0))
   +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]]

== Physical Plan ==
Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5]
+- Filter (NOT (_1#0 = 0) || NOT (_2#1 = 0))
   +- LocalTableScan [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]]
{code}


> Boolean logic in sql does not work  "not (A and B)" is not the same as  "(not 
> A) or (not B)"
> 
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Blocker
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048262#comment-15048262
 ] 

Fengdong Yu commented on SPARK-12233:
-

I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048264#comment-15048264
 ] 

Fengdong Yu commented on SPARK-12233:
-

I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated SPARK-12233:

Comment: was deleted

(was: I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks)

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048265#comment-15048265
 ] 

Fengdong Yu commented on SPARK-12233:
-

I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated SPARK-12233:

Comment: was deleted

(was: I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks)

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-12232) Consider exporting read.table in R

2015-12-09 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048261#comment-15048261
 ] 

Sun Rui commented on SPARK-12232:
-

+1 [~yanboliang]

Since we already have table(), we can change it to an S4 generic function, 
change the original table() function into an S4 method on "jobj" class, and add 
a new table() s4 method on "DataFrame".

If backward-compatibility is not a concern, we may consider rename table() to 
something tableToDF(). But it seems that table() is not so common in the R base 
package, I find no strong motivation for it.

> Consider exporting read.table in R
> --
>
> Key: SPARK-12232
> URL: https://issues.apache.org/jira/browse/SPARK-12232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Since we have read.df, read.json, read.parquet (some in pending PRs), we have 
> table() and we should consider having read.table() for consistency and 
> R-likeness.
> However, this conflicts with utils::read.table which returns a R data.frame.
> It seems neither table() or read.table() is desirable in this case.
> table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
> read.table: 
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048263#comment-15048263
 ] 

Fengdong Yu commented on SPARK-12233:
-

I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*:
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is low. HDDs have good capacity, but 
x2-x3 lower than SSDs.
How can we get both good?

*Solution*:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*:
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than x1.86 (it could 
be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.


  was:
Problem:
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is low. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

Solution:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
gets blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

Performance:
1. At the best case, our solution performs the same as all SSDs.
At the worst case, like all data are spilled to HDDs, no performance 
regression.
2. Compared with all HDDs, hierarchy store improves more than x1.86 (it 
could be higher, CPU reaches bottleneck in our test environment).
3. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.



> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*:
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is low. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*:
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*:
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than x1.86 (it could 
> be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we 
> support both RDD cache and shuffle and no extra inter process communication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12238) s/Advanced sources/External Sources in docs.

2015-12-09 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-12238:
---

 Summary: s/Advanced sources/External Sources in docs.
 Key: SPARK-12238
 URL: https://issues.apache.org/jira/browse/SPARK-12238
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Streaming
Reporter: Prashant Sharma


While reading the docs, I felt reading as external sources(instead of Advanced 
sources) seemed more appropriate as in they belong outside streaming core 
project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code

2015-12-09 Thread Tao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048496#comment-15048496
 ] 

Tao Li commented on SPARK-12179:


I try to use spark internal Row_Number() udf, but still has the problem.

select $DATE as date, 'main' as type, host,  rfhost, rfpv
from (
  select Row_Number() OVER (partition by host ORDER BY host ,rfpv desc) r, 
host, rfhost, rfpv
  from (
 select delhost(t0.host) as host, delhost(t0.rfhost) as rfhost 
,sum(t0.rfpv) as rfpv
  from (
select h.host as host,i.rfhost as rfhost ,i.rfpv as rfpv
from (
  select parse_url(ur,'HOST') as host,count(1) as pv
  from custom.web_sogourank_orc_zlib
  where logdate>=$starttime and logdate<=$endtime
  group by parse_url(ur,'HOST') order by pv desc  limit 1
) h left outer join (
  select parse_url(ur,'HOST') as host,parse_url(rf,'HOST') as rfhost , 
count(*) as rfpv
  from custom.web_sogourank_orc_zlib where logdate>=$starttime and 
logdate<=$endtime
  group by parse_url(ur,'HOST'), parse_url(rf,'HOST')
) i
on h.host = i.host ) t0
  group by delhost(t0.host),delhost(t0.rfhost)
distribute by host sort by host ,rfpv desc
  ) t1
) t2 where r<=10

> Spark SQL get different result with the same code
> -
>
> Key: SPARK-12179
> URL: https://issues.apache.org/jira/browse/SPARK-12179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 
> 1.5.2, 1.5.3
> Environment: hadoop version: 2.5.0-cdh5.3.2
> spark version: 1.5.3
> run mode: yarn-client
>Reporter: Tao Li
>Priority: Critical
>
> I run the sql in yarn-client mode, but get different result each time.
> As you can see the example, I get the different shuffle write with the same 
> shuffle read in two jobs with the same code.
> Some of my spark app runs well, but some always met this problem. And I met 
> this problem on spark 1.3, 1.4 and 1.5 version.
> Can you give me some suggestions about the possible causes or how do I figure 
> out the problem?
> 1. First Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.8 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54934
> 2. Second Run
> Details for Stage 9 (Attempt 0)
> Total Time Across All Tasks: 5.6 min
> Shuffle Read: 24.4 MB / 205399
> Shuffle Write: 6.8 MB / 54905



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Configure spark.hierarchyStore.
{code}
spark.hierarchyStore='nvm 50GB,ssd 80GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Configure spark.hierarchyStore.
{code}
-Dspark.hierarchyStore=nvm 50GB,ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
-Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.


> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than 

[jira] [Commented] (SPARK-11630) ClosureCleaner incorrectly warns for class based closures

2015-12-09 Thread W.H. (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048518#comment-15048518
 ] 

W.H. commented on SPARK-11630:
--

+1

> ClosureCleaner incorrectly warns for class based closures
> -
>
> Key: SPARK-11630
> URL: https://issues.apache.org/jira/browse/SPARK-11630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Frens Jan Rumph
>Priority: Trivial
>
> Spark's `ClosureCleaner` utility seems to check whether a function is an 
> anonymous function: [ClosureCleaner.scala on line 
> 49|https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L49]
>  If not, it warns the user.
> However, I'm using some class based functions. Something along the lines of:
> {code}
> trait FromUnreadRow[T] extends (UnreadRow => T) with Serializable
> object ToPlainRow extends FromUnreadRow[PlainRow] {
>   override def apply(row: UnreadRow): PlainRow = ???
> }
> {code}
> This works just fine. I can't really see that the warning is actually useful 
> in this case. I appreciate checking for common 'mistakes', but in my case a 
> user might be alarmed unnecessarily.
> Anything that can be done about this? Anything I can do?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. In spark-default.xml, configure spark.hierarchyStore.
{code}
-Dspark.hierarchyStore=nvm 50GB,ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
-Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. In spark-default.xml, configure spark.hierarchyStore.
{code}
spark.hierarchyStore nvm 50GB, ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
-Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.


> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with 

[jira] [Comment Edited] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-12-09 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048636#comment-15048636
 ] 

Adam Roberts edited comment on SPARK-9858 at 12/9/15 1:07 PM:
--

Thanks for the prompt reply, rowBuffer is a variable in 
org.apache.spark.sql.execution.UnsafeRowSerializer within the 
asKeyValueIterator method. I experimented with the Exchange class, same 
problems are observed using the SparkSqlSeriaizer; suggesting the 
UnsafeRowSerializer is probably fine.

I agree with your second comment, I think the code within 
org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere.

It'll be useful to determine how the values in the assertions can be determined 
programatically, I think the partitioning algorithm itself is working as 
expected but for some reason stages require more bytes on the platforms I'm 
using.

spark.sql.shuffle.partitions is unchanged, I'm working off the latest master 
code.

Is there something special about the aggregate, join, and complex query 2 tests?

Can we print exactly what the bytes are for each stage? I know rdd.count is 
always correct and the DataFrames are the same (printed each row, written to 
json and parquet - no concerns).

Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate 
test passes.

I'm wondering if there's an extra factor we should take into account when 
determining the indices regardless of platform.


was (Author: aroberts):
Thanks for the prompt reply, rowBuffer is a variable in 
org.apache.spark.sql.execution.UnsafeRowSerializer within the 
asKeyValueIterator method. I experimented with the Exchange class, same 
problems are observed using the SparkSqlSeriaizer; suggesting the 
UnsafeRowSerializer is probably fine.

I agree with your second comment, I think the code within 
org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere.

It'll be useful to determine how the values in the assertions can be determined 
programatically, I think the partitioning algorithm itself is working as 
expected but for some reason stages require more bytes on the platforms I'm 
using.

spark.sql.shuffle.partitions is unchanged, I'm working off the latest master 
code.

Is there something special about the aggregate, join, and complex query 2 tests?

Can we print exactly what the bytes are for each stage? I know rdd.count is 
always correct and the DataFrames are the same (printed each row, written to 
json and parquet - no concerns).

Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate 
test passes.

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12237) Unsupported message RpcMessage causes message retries

2015-12-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048651#comment-15048651
 ] 

Nan Zhu commented on SPARK-12237:
-

may I ask how you found this issue?

It seems that Master received RetrieveSparkProps message.which is supposed 
to be only transmitted between Executor and Driver

> Unsupported message RpcMessage causes message retries
> -
>
> Key: SPARK-12237
> URL: https://issues.apache.org/jira/browse/SPARK-12237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> When an unsupported message is sent to an endpoint, Spark throws 
> {{org.apache.spark.SparkException}} and retries sending the message. It 
> should *not* since the message is unsupported.
> {code}
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 1 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@c0a6275)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 2 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@73a76a5a)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 3 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@670bfda7)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151)
>   at 
> 

[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048574#comment-15048574
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048579#comment-15048579
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048572#comment-15048572
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048576#comment-15048576
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048571#comment-15048571
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048577#comment-15048577
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048570#comment-15048570
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048575#comment-15048575
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048573#comment-15048573
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048578#comment-15048578
 ] 

yucai commented on SPARK-12196:
---

Sorry, I have to delete this PR because of my wrong github operation, I will 
send a new one ASAP.

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Configure spark.hierarchyStore.
{code}
-Dspark.hierarchyStore=nvm 50GB,ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
-Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. In spark-default.xml, configure spark.hierarchyStore.
{code}
-Dspark.hierarchyStore=nvm 50GB,ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
-Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.


> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy 

[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2015-12-09 Thread Pere Ferrera Bertran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048609#comment-15048609
 ] 

Pere Ferrera Bertran commented on SPARK-3461:
-

What's the status of this?
We provided a hack on top of the current API for Java users here, in case it's 
interesting for people hitting this issue: 
http://www.datasalt.com/2015/12/a-scalable-groupbykey-and-secondary-sort-for-java-spark/

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. In spark-default.xml, configure spark.hierarchyStore.
{code}
spark.hierarchyStore nvm 50GB, ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
-Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. In spark-default.xml, configure spark.hierarchyStore.
{code}
spark.hierarchyStore nvm 50GB, ssd 80GB
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.


> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all 

[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Configure spark.storage.hierarchyStore.
{code}
spark.hierarchyStore='nvm 50GB,ssd 80GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Configure spark.hierarchyStore.
{code}
spark.hierarchyStore='nvm 50GB,ssd 80GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.


> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more 

[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-09 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619
 ] 

Irakli Machabeli commented on SPARK-12218:
--

I'm afraid I don't really know what that means, "plan by explain(true)"
Shall I type it in repl?

[
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047928#comment-15047928
]

Xiao Li commented on SPARK-12218:
-

Could you provide the plan by explain(true)? [~imachabeli] Thanks!

"(not A) or (not B)"

PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff',
'PreviouslyChargedOff'))").count()
not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff',
'PreviouslyChargedOff')))").count()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


> Boolean logic in sql does not work  "not (A and B)" is not the same as  "(not 
> A) or (not B)"
> 
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Blocker
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-12-09 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048636#comment-15048636
 ] 

Adam Roberts commented on SPARK-9858:
-

Thanks for the prompt reply, rowBuffer is a variable in 
org.apache.spark.sql.execution.UnsafeRowSerializer within the 
asKeyValueIterator method. I experimented with the Exchange class, same 
problems are observed using the SparkSqlSeriaizer; suggesting the 
UnsafeRowSerializer is probably fine.

I agree with your second comment, I think the code within 
org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere.

It'll be useful to determine how the values in the assertions can be determined 
programatically, I think the partitioning algorithm itself is working as 
expected but for some reason stages require more bytes on the platforms I'm 
using.

spark.sql.shuffle.partitions is unchanged, I'm working off the latest master 
code.

Is there something special about the aggregate, join, and complex query 2 tests?

Can we print exactly what the bytes are for each stage? I know rdd.count is 
always correct and the DataFrames are the same (printed each row, written to 
json and parquet - no concerns).

Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate 
test passes.

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048642#comment-15048642
 ] 

Apache Spark commented on SPARK-12196:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10225

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12229) How to Perform spark submit of application written in scala from Node js

2015-12-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048663#comment-15048663
 ] 

Nan Zhu commented on SPARK-12229:
-

https://github.com/spark-jobserver/spark-jobserver might be a good reference

BTW, I think you can forward this question to user list to get more feedback. 
Most of the discussions here are just for bug/feature tracking, etc. If you are 
comfortable about this, would you mind closing the issue?

> How to Perform spark submit of application written in scala from Node js
> 
>
> Key: SPARK-12229
> URL: https://issues.apache.org/jira/browse/SPARK-12229
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.3.1
> Environment: Linux 14.4
>Reporter: himanshu singhal
>  Labels: newbie
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> I am having an spark core application written in scala and now i am 
> developing the front end in node js but having a question how can i make the 
> spark submit to run the application from the node js 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12240) FileNotFoundException: (Too many open files) when using multiple groupby on DataFrames

2015-12-09 Thread Shubhanshu Mishra (JIRA)
Shubhanshu Mishra created SPARK-12240:
-

 Summary: FileNotFoundException: (Too many open files) when using 
multiple groupby on DataFrames
 Key: SPARK-12240
 URL: https://issues.apache.org/jira/browse/SPARK-12240
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
 Environment: Debian 3.2.68-1+deb7u6 x86_64 GNU/Linux
Reporter: Shubhanshu Mishra


Whenever, I try to do multiple grouping using data frames my job crashes with 
the error FileNotFoundException and message  = too many open files. 

I can do these groupings using RDD easily but when I use the DataFrame 
operation I see these issues. 
The code I am running:
```
df_t = df.filter(df['max_cum_rank'] == 
0).select(['col1','col2']).groupby('col1').agg(F.min('col2')).groupby('min(col2)').agg(F.countDistinct('col1')).toPandas()
```



In [151]: df_t = df.filter(df['max_cum_rank'] == 
0).select(['col1','col2']).groupby('col1').agg(F.min('col2')).groupby('min(col2)').agg(F.countDistinct('col1')).toPandas()
[Stage 27:=>(415 + 1) / 
416]15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while 
reverting partial writes to file 
/path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/22/temp_shuffle_1abbf917-842c-41ef-b113-ed60ee22e675
java.io.FileNotFoundException: 
/path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/22/temp_shuffle_1abbf917-842c-41ef-b113-ed60ee22e675
 (Too many open files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while 
reverting partial writes to file 
/path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/29/temp_shuffle_e35e6e28-fdbf-4775-a32d-d0f5fd882e9e
java.io.FileNotFoundException: 
/path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/29/temp_shuffle_e35e6e28-fdbf-4775-a32d-d0f5fd882e9e
 (Too many open files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while 
reverting partial writes to file 
/path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/18/temp_shuffle_2d26adcb-e3bb-4a01-8998-7428ebe5544d
java.io.FileNotFoundException: 
/path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/18/temp_shuffle_2d26adcb-e3bb-4a01-8998-7428ebe5544d
 (Too many open files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at 

[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Comment: was deleted

(was: Sorry, I have to delete this PR because of my wrong github operation, I 
will send a new one ASAP.)

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-09 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619
 ] 

Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:28 PM:
---

Below is the explain plan.
To make it clear, query that contains not (A and B) :
{code}
and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))
{code}  
produces wrong results, 
and query that is already expanded as (not A) or (not B) produces correct 
output.
By the way I saw in explain plan cast(0 as double)) so I tried to change 0 => 
0.0 but no difference.


Physical plan looks similar:

{code}
'Filter (('LoanID = 62231) && (NOT ('PaymentsReceived = 0) || NOT 'ExplicitRoll 
IN (PreviouslyPaidOff,PreviouslyChargedOff)))
Filter ((LoanID#8588 = 62231) && (NOT (PaymentsReceived#8601 = 0.0) || NOT 
ExplicitRoll#8611 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
{code}

Explain plan results: 

{code}
In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))").explain(True)
{code}

{noformat}
== Parsed Logical Plan ==
'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN 
(PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Analyzed Logical Plan ==
BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: 
int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: 
string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: 
string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, 
ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, 
PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, 
InterestPaid: double, LateFees: double, ServicingFees: double, 
RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, 
PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: 
double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: 
double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, 
DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, 
NetCashToInvestorsFromDebtSale: double, OZVintage: string
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as 
double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Optimized Logical Plan ==
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && 
ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 

[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-09 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619
 ] 

Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:40 PM:
---

Below is the explain plan.
To make it clear, query that contains not (A and B) :
{code}
and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))
{code}  
produces wrong results, 
and query that is already expanded as (not A) or (not B) produces correct 
output.



Physical plan look like this:

{code}
wrong results-- Filter ((LoanID#8629 = 62231) && NOT ((PaymentsReceived#8642 = 
0.0) && ExplicitRoll#8652 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
correct results -- Filter ((LoanID#8803 = 62231) && (NOT (PaymentsReceived#8816 
= 0.0) || NOT ExplicitRoll#8826 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
{code}

Explain plan results: 

{code}
Wrong:
In [15]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
PaymentsReceived=0.0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))").explain(True)
{code}

{noformat}
== Parsed Logical Plan ==
'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0.0) && 'ExplicitRoll 
IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8625,MnthRate#8626,ObservationMonth#8627,CycleCounter#8628,LoanID#8629,Loankey#8630,OriginationDate#8631,OriginationQuarter#8632,LoanAmount#8633,Term#8634,LenderRate#8635,ProsperRating#8636,ScheduledMonthlyPaymentAmount#8637,ChargeoffMonth#8638,ChargeoffAmount#8639,CompletedMonth#8640,MonthOfLastPayment#8641,PaymentsReceived#8642,CollectionFees#8643,PrincipalPaid#8644,InterestPaid#8645,LateFees#8646,ServicingFees#8647,RecoveryPayments#8648,RecoveryPrin#8649,DaysPastDue#8650,PriorMonthDPD#8651,ExplicitRoll#8652,SummaryRoll#8653,CumulPrin#8654,EOMPrin#8655,ScheduledPrinRemaining#8656,ScheduledCumulPrin#8657,ScheduledPeriodicPrin#8658,BOMPrin#8659,ListingNumber#8660,DebtSaleMonth#8661,GrossCashFromDebtSale#8662,DebtSaleFee#8663,NetCashToInvestorsFromDebtSale#8664,OZVintage#8665]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Analyzed Logical Plan ==
BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: 
int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: 
string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: 
string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, 
ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, 
PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, 
InterestPaid: double, LateFees: double, ServicingFees: double, 
RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, 
PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: 
double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: 
double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, 
DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, 
NetCashToInvestorsFromDebtSale: double, OZVintage: string
Filter ((LoanID#8629 = 62231) && NOT ((PaymentsReceived#8642 = cast(0.0 as 
double)) && ExplicitRoll#8652 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8625,MnthRate#8626,ObservationMonth#8627,CycleCounter#8628,LoanID#8629,Loankey#8630,OriginationDate#8631,OriginationQuarter#8632,LoanAmount#8633,Term#8634,LenderRate#8635,ProsperRating#8636,ScheduledMonthlyPaymentAmount#8637,ChargeoffMonth#8638,ChargeoffAmount#8639,CompletedMonth#8640,MonthOfLastPayment#8641,PaymentsReceived#8642,CollectionFees#8643,PrincipalPaid#8644,InterestPaid#8645,LateFees#8646,ServicingFees#8647,RecoveryPayments#8648,RecoveryPrin#8649,DaysPastDue#8650,PriorMonthDPD#8651,ExplicitRoll#8652,SummaryRoll#8653,CumulPrin#8654,EOMPrin#8655,ScheduledPrinRemaining#8656,ScheduledCumulPrin#8657,ScheduledPeriodicPrin#8658,BOMPrin#8659,ListingNumber#8660,DebtSaleMonth#8661,GrossCashFromDebtSale#8662,DebtSaleFee#8663,NetCashToInvestorsFromDebtSale#8664,OZVintage#8665]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Optimized Logical Plan ==
Filter ((LoanID#8629 = 62231) && NOT ((PaymentsReceived#8642 = 0.0) && 
ExplicitRoll#8652 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 

[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2015-12-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048802#comment-15048802
 ] 

Reynold Xin commented on SPARK-3461:


I'm going to close this one since this API is now available in the new Dataset 
API (SPARK-).


> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12217:
--
Target Version/s:   (was: 1.6.0)
   Fix Version/s: (was: 1.6.1)
  (was: 2.0.0)

[~BenFradet] please don't set fix or target version

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12219) Spark 1.5.2 code does not build on Scala 2.11.7 with SBT assembly

2015-12-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048807#comment-15048807
 ] 

Sean Owen commented on SPARK-12219:
---

Can you provide more detail? what exactly is the error and how are you 
building? it compiles vs 2.11 at the moment, or did so recently

> Spark 1.5.2 code does not build on Scala 2.11.7 with SBT assembly
> -
>
> Key: SPARK-12219
> URL: https://issues.apache.org/jira/browse/SPARK-12219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
>Reporter: Rodrigo Boavida
>
> I've tried with no success to build Spark on Scala 2.11.7. I'm getting build 
> errors using sbt due to the issues found in the below thread in July of this 
> year.
> https://mail-archives.apache.org/mod_mbox/spark-dev/201507.mbox/%3CCA+3qhFSJGmZToGmBU1=ivy7kr6eb7k8t6dpz+ibkstihryw...@mail.gmail.com%3E
> Seems some minor fixes are needed to make the Scala 2.11 compiler happy.
> I needed to build with SBT as per suggested on below thread to get over some 
> apparent maven shader plugin because which changed some classes when I change 
> to akka 2.4.0.
> https://groups.google.com/forum/#!topic/akka-user/iai6whR6-xU
> I've set this bug to Major priority assuming that the Spark community wants 
> to keep fully supporting SBT builds, including the Scala 2.11 compatibility.
> Tnks,
> Rod



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12207) "org.datanucleus" is already registered

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12207.
---
Resolution: Invalid

I don't think is appropriate as a JIRA; you're not narrowing down the problem 
to an actionable issue. It looks like a problem in your env.

> "org.datanucleus" is already registered
> ---
>
> Key: SPARK-12207
> URL: https://issues.apache.org/jira/browse/SPARK-12207
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: windows 7 64 bit
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> winutils ls \tmp\hive ->
> drwxrwxrwx 1 PC\Stefan BloomBear-SSD\None 0 Dec  8 2015 \tmp\hive
>Reporter: stefan
>
> I read the response to an identical issue:
> https://issues.apache.org/jira/browse/SPARK-11142
> and then did search  the mailing list archives here:
> http://apache-spark-user-list.1001560.n3.nabble.com/
> but got no help
> http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page=1=%22org.datanucleus%22+is+already+registered=0
> There were only  6 posts over all the dates but no resolution.
> My apache spark folder has 3 jar files in the lib folder
> datanucleus-api-jdo-3.2.6.jar
> datanucleus-core-3.2.10.jar
> datanucleus-rdbms-3.2.9.jar
> and these are compiled -> mostly unreadable
> There are no other files with the word 'datanucleus' on my pc.
> Maven's pom.xml file has been implicated
> http://stackoverflow.com/questions/877949/conflicting-versions-of-datanucleus-enhancer-in-a-maven-google-app-engine-projec
> but I do not have maven on this windows box.
> This post:
> http://www.worldofdb2.com/profiles/blogs/using-spark-s-interactive-scala-shell-for-accessing-db2-data
> suggests downloading 
> DB2 JDBC driver jar (db2jcc.jar or db2jcc4.jar) and setting 
> SPARK_CLASSPATH=c:\db2jcc.jar
> but the mailing list archives say SPARK_CLASSPATH is deprecated
> https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3C01a901d0547c$a23ba480$e6b2ed80$@innowireless.com%3E
> Could I have a ptr to a resolution to this problem.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12031) Integer overflow when do sampling.

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12031.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10023
[https://github.com/apache/spark/pull/10023]

> Integer overflow when do sampling.
> --
>
> Key: SPARK-12031
> URL: https://issues.apache.org/jira/browse/SPARK-12031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: uncleGen
>Priority: Critical
> Fix For: 2.0.0, 1.6.1
>
>
> In my case, some partitions contain too much items. When do range partition, 
> exception thrown as:
> {code}
> java.lang.IllegalArgumentException: n must be positive
> at java.util.Random.nextInt(Random.java:300)
> at 
> org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12195) Adding BigDecimal, Date and Timestamp into Encoder

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12195:
--
Assignee: Xiao Li

> Adding BigDecimal, Date and Timestamp into Encoder
> --
>
> Key: SPARK-12195
> URL: https://issues.apache.org/jira/browse/SPARK-12195
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 1.6.0
>
>
> Add three types like DecimalType, DateType and TimestampType into Encoder for 
> DataSet APIs.
> DecimalType -> java.math.BigDecimal
> DateType -> java.sql.Date
> TimestampType -> java.sql.Timestamp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12201) add type coercion rule for greatest/least

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12201:
--
Assignee: Wenchen Fan

> add type coercion rule for greatest/least
> -
>
> Key: SPARK-12201
> URL: https://issues.apache.org/jira/browse/SPARK-12201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-09 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619
 ] 

Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:27 PM:
---

Below is the explain plan.
To make it clear, query that contains not (A and B) :
{code}
and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))")
{code}  
produces wrong results, 
and query that is already expanded as (not A) or (not B) produces correct 
output.
By the way I saw in explain plan cast(0 as double)) so I tried to change 0 => 
0.0 but no difference.


Physical plan looks similar:

{code}
'Filter (('LoanID = 62231) && (NOT ('PaymentsReceived = 0) || NOT 'ExplicitRoll 
IN (PreviouslyPaidOff,PreviouslyChargedOff)))
Filter ((LoanID#8588 = 62231) && (NOT (PaymentsReceived#8601 = 0.0) || NOT 
ExplicitRoll#8611 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
{code}

Explain plan results: 

{code}
In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))").explain(True)
{code}

{noformat}
== Parsed Logical Plan ==
'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN 
(PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Analyzed Logical Plan ==
BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: 
int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: 
string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: 
string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, 
ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, 
PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, 
InterestPaid: double, LateFees: double, ServicingFees: double, 
RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, 
PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: 
double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: 
double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, 
DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, 
NetCashToInvestorsFromDebtSale: double, OZVintage: string
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as 
double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Optimized Logical Plan ==
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && 
ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 

[jira] [Created] (SPARK-12241) Improve failure reporting in Yarn client obtainTokenForHBase()

2015-12-09 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-12241:
--

 Summary: Improve failure reporting in Yarn client 
obtainTokenForHBase()
 Key: SPARK-12241
 URL: https://issues.apache.org/jira/browse/SPARK-12241
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.5.2
 Environment: Kerberos
Reporter: Steve Loughran
Priority: Minor


SPARK-11265 improved reporting of reflection problems getting the Hive token in 
a secure cluster.

The `obtainTokenForHBase()` method needs to pick up the same policy of what 
exceptions to swallow/reject, and report any swallowed exceptions with the full 
stack trace.

without this, problems authenticating with HBase result in log errors with no 
meaningful information at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12242) DataFrame.transform function

2015-12-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12242:
---

 Summary: DataFrame.transform function
 Key: SPARK-12242
 URL: https://issues.apache.org/jira/browse/SPARK-12242
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Similar to Dataset.transform.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12242) DataFrame.transform function

2015-12-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048818#comment-15048818
 ] 

Apache Spark commented on SPARK-12242:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10226

> DataFrame.transform function
> 
>
> Key: SPARK-12242
> URL: https://issues.apache.org/jira/browse/SPARK-12242
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Similar to Dataset.transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12242) DataFrame.transform function

2015-12-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12242:


Assignee: Reynold Xin  (was: Apache Spark)

> DataFrame.transform function
> 
>
> Key: SPARK-12242
> URL: https://issues.apache.org/jira/browse/SPARK-12242
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Similar to Dataset.transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-09 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12196:
--
Description: 
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Configure spark.storage.hierarchyStore.
{code}
spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.

  was:
*Problem*
Nowadays, users have both SSDs and HDDs. 
SSDs have great performance, but capacity is small. HDDs have good capacity, 
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
storage. 
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets 
blocks from SSDs first, and when SSD’s useable space is less than some 
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any 
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance 
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we 
support both RDD cache and shuffle and no extra inter process communication.

*Usage*
1. Configure spark.storage.hierarchyStore.
{code}
spark.hierarchyStore='nvm 50GB,ssd 80GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
the rest form the last layer.

2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir 
or yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm 
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last 
layer.


> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy 

[jira] [Resolved] (SPARK-12229) How to Perform spark submit of application written in scala from Node js

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12229.
---
  Resolution: Invalid
Target Version/s:   (was: 1.3.1)

[~himanshu.techsk...@gmail.com] as I say, the user@ list is the right place for 
questions. Don't open a JIRA.

> How to Perform spark submit of application written in scala from Node js
> 
>
> Key: SPARK-12229
> URL: https://issues.apache.org/jira/browse/SPARK-12229
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.3.1
> Environment: Linux 14.4
>Reporter: himanshu singhal
>  Labels: newbie
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> I am having an spark core application written in scala and now i am 
> developing the front end in node js but having a question how can i make the 
> spark submit to run the application from the node js 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12216) Spark failed to delete temp directory

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12216:
--
   Priority: Minor  (was: Major)
Component/s: Spark Shell

[~skypickle] have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a JIRA. This can't be Major and needs a component.

It's not clear why: permissions problem? something else has that dir open? Is 
it reproducible? I suspect it's a problem specific to Windows. Do you have a 
suggestion about how to work around it?

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6929) Alias for more complex expression causes attribute not been able to resolve

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6929.
--
Resolution: Duplicate

> Alias for more complex expression causes attribute not been able to resolve
> ---
>
> Key: SPARK-6929
> URL: https://issues.apache.org/jira/browse/SPARK-6929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michał Warecki
>Priority: Critical
>
> I've extracted the minimal query that don't work with aliases. You can remove 
> tstudent expression ((tstudent((COUNT(g_0.test2_value) - 1)) from that query 
> and result will be the same. In exception you can see that c_0 is not 
> resolved but c_1 cause that problem.
> {code}
> SELECT g_0.test1 AS c_0, (AVG(g_0.test2) - ((tstudent((COUNT(g_0.test2_value) 
> - 1)) * stddev(g_0.test2_value)) / sqrt(convert(COUNT(g_0.test2), long AS 
> c_1 FROM sometable AS g_0 GROUP BY g_0.test1 ORDER BY c_0 LIMIT 502
> {code}
> cause exception:
> {code}
> Remote org.apache.spark.sql.AnalysisException: cannot resolve 'c_0' given 
> input columns c_0, c_1; line 1 pos 246
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 

[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-09 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619
 ] 

Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 1:47 PM:
---

In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))").explain(True)
== Parsed Logical Plan ==
'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN 
(PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Analyzed Logical Plan ==
BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: 
int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: 
string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: 
string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, 
ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, 
PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, 
InterestPaid: double, LateFees: double, ServicingFees: double, 
RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, 
PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: 
double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: 
double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, 
DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, 
NetCashToInvestorsFromDebtSale: double, OZVintage: string
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as 
double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Optimized Logical Plan ==
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && 
ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Physical Plan ==
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && 
ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 Scan 

[jira] [Commented] (SPARK-2280) Java & Scala reference docs should describe function reference behavior.

2015-12-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048750#comment-15048750
 ] 

Sean Owen commented on SPARK-2280:
--

I think it's still a reasonable improvement to the docs. I'd restrict it to the 
RDD class and related core RDD classes like JavaRDDLike and PairRDDFunctions. 
If interested, maybe start with a PR as a "[WIP]" work in progress to see if 
people support adding all the tags and review the descriptions, then finish the 
job if it's OK.

> Java & Scala reference docs should describe function reference behavior.
> 
>
> Key: SPARK-2280
> URL: https://issues.apache.org/jira/browse/SPARK-2280
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>Priority: Minor
>  Labels: starter
>
> Example
>  JavaPairRDD groupBy(Function f)
> Return an RDD of grouped elements. Each group consists of a key and a 
> sequence of elements mapping to that key. 
> T and K are not described and there is no explanation of what the function's 
> inputs and outputs should be and how GroupBy uses this information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12208) Abstract the examples into a common place

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12208:
--
Target Version/s:   (was: 1.6.0)
Priority: Minor  (was: Major)

> Abstract the examples into a common place
> -
>
> Key: SPARK-12208
> URL: https://issues.apache.org/jira/browse/SPARK-12208
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Priority: Minor
>
> When we write examples in the code, we put the generation of the data along 
> with the example itself. We typically have either:
> {code}
> val data = 
> sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
> ...
> {code}
> or some more esoteric stuff such as:
> {code}
> val data = Array(
>   (0, 0.1),
>   (1, 0.8),
>   (2, 0.2)
> )
> val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", 
> "feature")
> {code}
> {code}
> val data = Array(
>   Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
>   Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
>   Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
> )
> val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
> {code}
> I suggest we follow the example of sklearn and standardize all the generation 
> of example data inside a few methods, for example in 
> {{org.apache.spark.ml.examples.ExampleData}}. One reason is that just reading 
> the code is sometimes not enough to figure out what the data is supposed to 
> be. For example when using {{libsvm_data}}, it is unclear what the dataframe 
> columns are. This is something we should comment somewhere.
> Also, it would help explaining in one place all the scala idiosyncracies such 
> as using {{Tuple1.apply}} and such.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12031) Integer overflow when do sampling.

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12031:
--
Assignee: uncleGen
Priority: Major  (was: Critical)

> Integer overflow when do sampling.
> --
>
> Key: SPARK-12031
> URL: https://issues.apache.org/jira/browse/SPARK-12031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: uncleGen
>Assignee: uncleGen
> Fix For: 1.6.1, 2.0.0
>
>
> In my case, some partitions contain too much items. When do range partition, 
> exception thrown as:
> {code}
> java.lang.IllegalArgumentException: n must be positive
> at java.util.Random.nextInt(Random.java:300)
> at 
> org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2015-12-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3461.

   Resolution: Implemented
 Assignee: Reynold Xin  (was: Sandy Ryza)
Fix Version/s: 1.6.0

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 1.6.0
>
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-09 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619
 ] 

Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:22 PM:
---

Below is the explain plan.
To make it clear query that contains 
{code}
"and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))")"  
{code}  
produces wrong results, one that is already expanded as (not A) or (not B) 
produces correct output.


Physical plan looks similar:

{code}
'Filter (('LoanID = 62231) && (NOT ('PaymentsReceived = 0) || NOT 'ExplicitRoll 
IN (PreviouslyPaidOff,PreviouslyChargedOff)))
Filter ((LoanID#8588 = 62231) && (NOT (PaymentsReceived#8601 = 0.0) || NOT 
ExplicitRoll#8611 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
{code}

Explain plan results: 

{code}
In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))").explain(True)
{code}

{noformat}
== Parsed Logical Plan ==
'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN 
(PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Analyzed Logical Plan ==
BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: 
int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: 
string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: 
string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, 
ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, 
PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, 
InterestPaid: double, LateFees: double, ServicingFees: double, 
RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, 
PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: 
double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: 
double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, 
DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, 
NetCashToInvestorsFromDebtSale: double, OZVintage: string
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as 
double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Optimized Logical Plan ==
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && 
ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 

[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-09 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619
 ] 

Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:25 PM:
---

Below is the explain plan.
To make it clear, query that contains not (A and B) :
{code}
"and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))")"  
{code}  
produces wrong results, 
and query that is already expanded as (not A) or (not B) produces correct 
output.
By the way I saw in explain plan cast(0 as double)) so I tried to change 0 => 
0.0 but no difference.


Physical plan looks similar:

{code}
'Filter (('LoanID = 62231) && (NOT ('PaymentsReceived = 0) || NOT 'ExplicitRoll 
IN (PreviouslyPaidOff,PreviouslyChargedOff)))
Filter ((LoanID#8588 = 62231) && (NOT (PaymentsReceived#8601 = 0.0) || NOT 
ExplicitRoll#8611 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
{code}

Explain plan results: 

{code}
In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
'PreviouslyChargedOff'))").explain(True)
{code}

{noformat}
== Parsed Logical Plan ==
'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN 
(PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Analyzed Logical Plan ==
BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: 
int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: 
string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: 
string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, 
ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, 
PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, 
InterestPaid: double, LateFees: double, ServicingFees: double, 
RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, 
PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: 
double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: 
double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, 
DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, 
NetCashToInvestorsFromDebtSale: double, OZVintage: string
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as 
double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 
Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583]
 ParquetRelation[file:/d:/MktLending/prp_enh1]

== Optimized Logical Plan ==
Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && 
ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
 

[jira] [Comment Edited] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-12-09 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048636#comment-15048636
 ] 

Adam Roberts edited comment on SPARK-9858 at 12/9/15 2:35 PM:
--

Thanks for the prompt reply, rowBuffer is a variable in 
org.apache.spark.sql.execution.UnsafeRowSerializer within the 
asKeyValueIterator method. I experimented with the Exchange class, same 
problems are observed using the SparkSqlSeriaizer; suggesting the 
UnsafeRowSerializer is probably fine.

I agree with your second comment, I think the code within 
org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere.

It'll be useful to determine how the values in the assertions can be determined 
programatically, I think the partitioning algorithm itself is working as 
expected but for some reason stages require more bytes on the platforms I'm 
using.

spark.sql.shuffle.partitions is unchanged, I'm working off the latest master 
code.

Is there something special about the aggregate, join, and complex query 2 tests?

Can we print exactly what the bytes are for each stage? I know rdd.count is 
always correct and the DataFrames are the same (printed each row, written to 
json and parquet - no concerns).

Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate 
test passes.

I'm wondering if there's an extra factor we should be taking into account


was (Author: aroberts):
Thanks for the prompt reply, rowBuffer is a variable in 
org.apache.spark.sql.execution.UnsafeRowSerializer within the 
asKeyValueIterator method. I experimented with the Exchange class, same 
problems are observed using the SparkSqlSeriaizer; suggesting the 
UnsafeRowSerializer is probably fine.

I agree with your second comment, I think the code within 
org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere.

It'll be useful to determine how the values in the assertions can be determined 
programatically, I think the partitioning algorithm itself is working as 
expected but for some reason stages require more bytes on the platforms I'm 
using.

spark.sql.shuffle.partitions is unchanged, I'm working off the latest master 
code.

Is there something special about the aggregate, join, and complex query 2 tests?

Can we print exactly what the bytes are for each stage? I know rdd.count is 
always correct and the DataFrames are the same (printed each row, written to 
json and parquet - no concerns).

Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate 
test passes.

I'm wondering if there's an extra factor we should take into account when 
determining the indices regardless of platform.

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6918) Secure HBase with Kerberos does not work over YARN

2015-12-09 Thread Pierre Beauvois (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048799#comment-15048799
 ] 

Pierre Beauvois commented on SPARK-6918:


Hi, I aslo have the issue on my platform.
Here are my specs:
* Spark 1.5.2
* HBase 1.1.2
* Hadoop 2.7.1
* Zookeeper 3.4.5
* Authentication done through Kerberos

I added the option "spark.driver.extraClassPath" in the spark-defaults.conf 
which contains the HBASE_CONF_DIR as below:
spark.driver.extraClassPath = /opt/application/Hbase/current/conf/

On the driver, I start spark-shell (I'm running it in yarn-client mode) 
{code}
[my_user@uabigspark01 ~]$ spark-shell -v --name HBaseTest --jars 
/opt/application/Hbase/current/lib/hbase-common-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-server-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-client-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-protocol-1.1.2.jar,/opt/application/Hbase/current/lib/protobuf-java-2.5.0.jar,/opt/application/Hbase/current/lib/htrace-core-3.1.0-incubating.jar,/opt/application/Hbase/current/lib/hbase-annotations-1.1.2.jar,/opt/application/Hbase/current/lib/guava-12.0.1.jar
{code}
Then I run the following lines:
{code}
scala> import org.apache.spark._
import org.apache.spark._

scala> import org.apache.spark.rdd.NewHadoopRDD
import org.apache.spark.rdd.NewHadoopRDD

scala> import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.Path

scala> import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.util.Bytes

scala> import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.HColumnDescriptor

scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

scala> import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result}
import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result}

scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

scala> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.io.ImmutableBytesWritable

scala> val conf = HBaseConfiguration.create()
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, 
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, 
hbase-site.xml

scala> conf.addResource(new 
Path("/opt/application/Hbase/current/conf/hbase-site.xml"))

scala> conf.set("hbase.zookeeper.quorum", "FQDN1:2181,FQDN2:2181,FQDN3:2181")

scala> conf.set(TableInputFormat.INPUT_TABLE, "user:noheader")

scala> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], 
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], 
classOf[org.apache.hadoop.hbase.client.Result])
2015-12-09 15:17:58,890 INFO  [main] storage.MemoryStore: 
ensureFreeSpace(266248) called with curMem=0, maxMem=556038881
2015-12-09 15:17:58,892 INFO  [main] storage.MemoryStore: Block broadcast_0 
stored as values in memory (estimated size 260.0 KB, free 530.0 MB)
2015-12-09 15:17:59,196 INFO  [main] storage.MemoryStore: 
ensureFreeSpace(32808) called with curMem=266248, maxMem=556038881
2015-12-09 15:17:59,197 INFO  [main] storage.MemoryStore: Block 
broadcast_0_piece0 stored as bytes in memory (estimated size 32.0 KB, free 
530.0 MB)
2015-12-09 15:17:59,199 INFO  [sparkDriver-akka.actor.default-dispatcher-2] 
storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 
192.168.200.208:60217 (size: 32.0 KB, free: 530.2 MB)
2015-12-09 15:17:59,203 INFO  [main] spark.SparkContext: Created broadcast 0 
from newAPIHadoopRDD at :34
hBaseRDD: 
org.apache.spark.rdd.RDD[(org.apache.hadoop.hbase.io.ImmutableBytesWritable, 
org.apache.hadoop.hbase.client.Result)] = NewHadoopRDD[0] at newAPIHadoopRDD at 
:34

scala> hBaseRDD.count()
2015-12-09 15:18:09,441 INFO  [main] zookeeper.RecoverableZooKeeper: Process 
identifier=hconnection-0x4a52ac6 connecting to ZooKeeper 
ensemble=FQDN1:2181,FQDN2:2181,FQDN3:2181
2015-12-09 15:18:09,456 INFO  [main] zookeeper.ZooKeeper: Client 
environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
2015-12-09 15:18:09,456 INFO  [main] zookeeper.ZooKeeper: Client 
environment:host.name=DRIVER.FQDN
2015-12-09 15:18:09,456 INFO  [main] zookeeper.ZooKeeper: Client 
environment:java.version=1.7.0_85
2015-12-09 15:18:09,456 INFO  [main] zookeeper.ZooKeeper: Client 
environment:java.vendor=Oracle Corporation
2015-12-09 15:18:09,456 INFO  [main] zookeeper.ZooKeeper: Client 
environment:java.home=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/jre
2015-12-09 15:18:09,456 INFO  [main] zookeeper.ZooKeeper: Client 

[jira] [Updated] (SPARK-12188) [SQL] Code refactoring and comment correction in Dataset APIs

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12188:
--
Assignee: Xiao Li

> [SQL] Code refactoring and comment correction in Dataset APIs
> -
>
> Key: SPARK-12188
> URL: https://issues.apache.org/jira/browse/SPARK-12188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 1.6.0
>
>
> - Created a new private variable `boundTEncoder` that can be shared by 
> multiple functions, `RDD`, `select` and `collect`. 
> - Replaced all the `queryExecution.analyzed` by the function call 
> `logicalPlan`
> - A few API comments are using wrong class names (e.g., `DataFrame`) or 
> parameter names (e.g., `n`)
> - A few API descriptions are wrong. (e.g., `mapPartitions`)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12222) deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1:
--
Assignee: Fei Wang

> deserialize RoaringBitmap using Kryo serializer throw Buffer underflow 
> exception
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Fei Wang
>Assignee: Fei Wang
> Fix For: 1.6.0, 2.0.0
>
>
> here are some problems when deserialize RoaringBitmap. see the examples below:
> run this piece of code
> ```
> import com.esotericsoftware.kryo.io.{Input => KryoInput, Output => KryoOutput}
> import java.io.DataInput
> class KryoInputDataInputBridge(input: KryoInput) extends DataInput {
>   override def readLong(): Long = input.readLong()
>   override def readChar(): Char = input.readChar()
>   override def readFloat(): Float = input.readFloat()
>   override def readByte(): Byte = input.readByte()
>   override def readShort(): Short = input.readShort()
>   override def readUTF(): String = input.readString() // readString in 
> kryo does utf8
>   override def readInt(): Int = input.readInt()
>   override def readUnsignedShort(): Int = input.readShortUnsigned()
>   override def skipBytes(n: Int): Int = input.skip(n.toLong).toInt
>   override def readFully(b: Array[Byte]): Unit = input.read(b)
>   override def readFully(b: Array[Byte], off: Int, len: Int): Unit = 
> input.read(b, off, len)
>   override def readLine(): String = throw new 
> UnsupportedOperationException("readLine")
>   override def readBoolean(): Boolean = input.readBoolean()
>   override def readUnsignedByte(): Int = input.readByteUnsigned()
>   override def readDouble(): Double = input.readDouble()
> }
> class KryoOutputDataOutputBridge(output: KryoOutput) extends DataOutput {
>   override def writeFloat(v: Float): Unit = output.writeFloat(v)
>   // There is no "readChars" counterpart, except maybe "readLine", which 
> is not supported
>   override def writeChars(s: String): Unit = throw new 
> UnsupportedOperationException("writeChars")
>   override def writeDouble(v: Double): Unit = output.writeDouble(v)
>   override def writeUTF(s: String): Unit = output.writeString(s) // 
> writeString in kryo does UTF8
>   override def writeShort(v: Int): Unit = output.writeShort(v)
>   override def writeInt(v: Int): Unit = output.writeInt(v)
>   override def writeBoolean(v: Boolean): Unit = output.writeBoolean(v)
>   override def write(b: Int): Unit = output.write(b)
>   override def write(b: Array[Byte]): Unit = output.write(b)
>   override def write(b: Array[Byte], off: Int, len: Int): Unit = 
> output.write(b, off, len)
>   override def writeBytes(s: String): Unit = output.writeString(s)
>   override def writeChar(v: Int): Unit = output.writeChar(v.toChar)
>   override def writeLong(v: Long): Unit = output.writeLong(v)
>   override def writeByte(v: Int): Unit = output.writeByte(v)
> }
> val outStream = new FileOutputStream("D:\\wfserde")
> val output = new KryoOutput(outStream)
> val bitmap = new RoaringBitmap
> bitmap.add(1)
> bitmap.add(3)
> bitmap.add(5)
> bitmap.serialize(new KryoOutputDataOutputBridge(output))
> output.flush()
> output.close()
> val inStream = new FileInputStream("D:\\wfserde")
> val input = new KryoInput(inStream)
> val ret = new RoaringBitmap
> ret.deserialize(new KryoInputDataInputBridge(input))
> println(ret)
> ```
> this will throw `Buffer underflow` error:
> ```
> com.esotericsoftware.kryo.KryoException: Buffer underflow.
>   at com.esotericsoftware.kryo.io.Input.require(Input.java:156)
>   at com.esotericsoftware.kryo.io.Input.skip(Input.java:131)
>   at com.esotericsoftware.kryo.io.Input.skip(Input.java:264)
>   at 
> org.apache.spark.sql.SQLQuerySuite$$anonfun$6$KryoInputDataInputBridge$1.skipBytes
> ```
> after same investigation,  i found this is caused by a bug of kryo's 
> `Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) 
> and we call this method in `KryoInputDataInputBridge`.
> So i think we can fix this issue in this two ways:
> 1) upgrade the kryo version to 2.23.0 or 2.24.0, which has fix this bug in 
> kryo (i am not sure the upgrade is safe in spark, can you check it? @davies )
> 2) we can bypass the  kryo's `Input.skip(long count)` by directly call 
> another `skip` method in kryo's 
> Input.java(https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/io/Input.java#L124),
>  i.e. write the bug-fixed version of `Input.skip(long count)` in 

[jira] [Reopened] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-11621:
---

> ORC filter pushdown not working properly after new unhandled filter interface.
> --
>
> Key: SPARK-11621
> URL: https://issues.apache.org/jira/browse/SPARK-11621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>
> After the new interface to get rid of filters predicate-push-downed which are 
> already processed in datasource-level 
> (https://github.com/apache/spark/pull/9399), it dose not push down filters 
> for ORC.
> This is because at {{DataSourceStrategy}}, all the filters are treated as 
> unhandled filters.
> Also, since ORC does not support to filter fully record by record but instead 
> rough results came out, the filters for ORC should not go to unhandled 
> filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12213) Query with only one distinct should not having on expand

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12213:
--
Component/s: SQL

> Query with only one distinct should not having on expand
> 
>
> Key: SPARK-12213
> URL: https://issues.apache.org/jira/browse/SPARK-12213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Expand will double the number of records, slow down projection and 
> aggregation, it's better to generate a plan without Expand for a query with 
> only one distinct (for example, ss_max in TPCDS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11621.
---
   Resolution: Duplicate
Fix Version/s: (was: 1.6.0)

> ORC filter pushdown not working properly after new unhandled filter interface.
> --
>
> Key: SPARK-11621
> URL: https://issues.apache.org/jira/browse/SPARK-11621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
>
> After the new interface to get rid of filters predicate-push-downed which are 
> already processed in datasource-level 
> (https://github.com/apache/spark/pull/9399), it dose not push down filters 
> for ORC.
> This is because at {{DataSourceStrategy}}, all the filters are treated as 
> unhandled filters.
> Also, since ORC does not support to filter fully record by record but instead 
> rough results came out, the filters for ORC should not go to unhandled 
> filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10109) NPE when saving Parquet To HDFS

2015-12-09 Thread Paul Zaczkieiwcz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048821#comment-15048821
 ] 

Paul Zaczkieiwcz commented on SPARK-10109:
--

Still seeing this in 1.5.1. My use case is the same. I've got several 
independent spark jobs downloading data and appending to the same partitioned 
parquet directory in HDFS. None of the partitions overlap but the _temporary 
folder has become a bottleneck, only allowing one job to append at a time.

> NPE when saving Parquet To HDFS
> ---
>
> Key: SPARK-10109
> URL: https://issues.apache.org/jira/browse/SPARK-10109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Sparc-ec2, standalone cluster on amazon
>Reporter: Virgil Palanciuc
>
> Very simple code, trying to save a dataframe
> I get this in the driver
> {quote}
> 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID 
> 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) 
> and  (not for that task):
> 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID 
> 5607, 172.yy.yy.yy): java.lang.NullPointerException
> at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
> at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
> at 
> parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
> at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> I get this in the executor log:
> {quote}
> 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet
>  File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not 
> have any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
>   

[jira] [Commented] (SPARK-2388) Streaming from multiple different Kafka topics is problematic

2015-12-09 Thread Dan Dutrow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048885#comment-15048885
 ] 

Dan Dutrow commented on SPARK-2388:
---

The Kafka Direct API lets you insert a callback function that gives you access 
to the topic name and other metadata besides they key.

> Streaming from multiple different Kafka topics is problematic
> -
>
> Key: SPARK-2388
> URL: https://issues.apache.org/jira/browse/SPARK-2388
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Sergey
> Fix For: 1.0.1
>
>
> Default way of creating stream out of Kafka source would be as
> val stream = KafkaUtils.createStream(ssc,"localhost:2181","logs", 
> Map("retarget" -> 2,"datapair" -> 2))
> However, if two topics - in this case "retarget" and "datapair" - are very 
> different, there is no way to set up different filter, mapping functions, 
> etc), as they are effectively merged.
> However, instance of KafkaInputDStream, created with this call internally 
> calls ConsumerConnector.createMessageStream() which returns *map* of 
> KafkaStreams, keyed by topic. It would be great if this map would be exposed 
> somehow, so aforementioned call 
> val streamS = KafkaUtils.createStreamS(...)
> returned map of streams.
> Regards,
> Sergey Malov
> Collective Media



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12241) Improve failure reporting in Yarn client obtainTokenForHBase()

2015-12-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12241:


Assignee: Apache Spark

> Improve failure reporting in Yarn client obtainTokenForHBase()
> --
>
> Key: SPARK-12241
> URL: https://issues.apache.org/jira/browse/SPARK-12241
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.5.2
> Environment: Kerberos
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-11265 improved reporting of reflection problems getting the Hive token 
> in a secure cluster.
> The `obtainTokenForHBase()` method needs to pick up the same policy of what 
> exceptions to swallow/reject, and report any swallowed exceptions with the 
> full stack trace.
> without this, problems authenticating with HBase result in log errors with no 
> meaningful information at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-12-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9059.
--
Resolution: Not A Problem

> Update Python Direct Kafka Word count examples to show the use of 
> HasOffsetRanges
> -
>
> Key: SPARK-9059
> URL: https://issues.apache.org/jira/browse/SPARK-9059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> Update Python examples of Direct Kafka word count to access the offset ranges 
> using HasOffsetRanges and print it. For example in Scala,
>  
> {code}
> var offsetRanges: Array[OffsetRange] = _
> ...
> directKafkaDStream.foreachRDD { rdd => 
> offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
> }
> ...
> transformedDStream.foreachRDD { rdd => 
> // some operation
> println("Processed ranges: " + offsetRanges)
> }
> {code}
> See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
> more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers

2015-12-09 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-12203.
---
Resolution: Won't Fix

> Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers 
> using receivers
> ---
>
> Key: SPARK-12203
> URL: https://issues.apache.org/jira/browse/SPARK-12203
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>
> Currently, we have DirectKafkaInputDStream, which directly pulls messages 
> from Kafka Brokers without any receivers, and KafkaInputDStream, which pulls 
> messages from a Kafka Broker using receiver with zookeeper.
> As we observed, because DirectKafkaInputDStream retrieves messages from Kafka 
> after each batch finishes, it posts a latency compared with KafkaInputDStream 
> that continues to pull messages during each batch window.
> So we try to add KafkaDirectInputDStream that directly pulls messages from 
> Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as 
> KafkaInputDStream and pulls messages during each batch window.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >