date:20151230


[ 
https://issues.apache.org/jira/browse/SPARK-12039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075637#comment-15075637
 ] 

Apache Spark commented on SPARK-12039:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10533

> HiveSparkSubmitSuite's SPARK-9757 Persist Parquet relation with decimal 
> column is very flaky
> 
>
> Key: SPARK-12039
> URL: https://issues.apache.org/jira/browse/SPARK-12039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/4121/consoleFull
> It frequently fails to download 
> `commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar` in 
> hadoop 1 tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12590) Inconsistent behavior of randomSplit in YARN mode

2015-12-30 Thread Gaurav Kumar (JIRA)

Gaurav Kumar created SPARK-12590:


 Summary: Inconsistent behavior of randomSplit in YARN mode
 Key: SPARK-12590
 URL: https://issues.apache.org/jira/browse/SPARK-12590
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Affects Versions: 1.5.2
 Environment: YARN mode
Reporter: Gaurav Kumar


I noticed an inconsistent behavior when using rdd.randomSplit when the source 
rdd is repartitioned, but only in YARN mode. It works fine in local mode though.

*Code:*
val rdd = sc.parallelize(1 to 100)
val rdd2 = rdd.repartition(64)
rdd.partitions.size
rdd2.partitions.size
val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
train.takeOrdered(10)
test.takeOrdered(10)

*Master: local*
Both the take statements produce consistent results and have no overlap in 
numbers being outputted.

*Master: YARN*
However, when these are run on YARN mode, these produce random results every 
time and also the train and test have overlap in the numbers being outputted.
If I use rdd.randomSplit, then it works fine even on YARN.

So, it concludes that the repartition is being evaluated every time the 
splitting occurs.

Interestingly, if I cache the rdd2 before splitting it, then we can expect 
consistent behavior since repartition is not evaluated again and again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7689) Remove TTL-based metadata cleaning (spark.cleaner.ttl)


 [ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7689:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11806

> Remove TTL-based metadata cleaning (spark.cleaner.ttl)
> --
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> For Spark 2.0, I think we should remove {{spark.cleaner.ttl}} and the 
> associated TTL-based metadata cleaning code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch


[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075678#comment-15075678
 ] 

Reynold Xin commented on SPARK-6166:


[~zsxwing] can you pick this one up?


> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-529) Have a single file that controls the environmental variables and spark config options


[ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075696#comment-15075696
 ] 

Apache Spark commented on SPARK-529:


User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10205

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-529) Have a single file that controls the environmental variables and spark config options


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-529:
--

Assignee: Apache Spark

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-529) Have a single file that controls the environmental variables and spark config options


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-529:
--

Assignee: (was: Apache Spark)

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12533) hiveContext.table() throws the wrong exception

2015-12-30 Thread Thomas Sebastian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075451#comment-15075451
 ] 

Thomas Sebastian edited comment on SPARK-12533 at 12/31/15 5:52 AM:


Submitted the pull request
https://github.com/apache/spark/pull/10529

Could this be reviewed please?


was (Author: thomastechs):
Submitted the pull request

> hiveContext.table() throws the wrong exception
> --
>
> Key: SPARK-12533
> URL: https://issues.apache.org/jira/browse/SPARK-12533
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> This should throw an {{AnalysisException}} that includes the table name 
> instead of the following:
> {code}
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:122)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:384)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:458)
>   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:458)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:830)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:826)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-30 Thread yucai (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

yucai updated SPARK-12196:
--
Description:
*Problem*
Nowadays, users have both SSDs and HDDs.
SSDs have great performance, but capacity is small. HDDs have good capacity,
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup
storage.
When Spark core allocates blocks (either for shuffle or RDD cache), it gets
blocks from SSDs first, and when SSD’s useable space is less than some
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we
support both RDD cache and shuffle and no extra inter process communication.

*Test Environment*
1. 4 IVB box(40 cores, 192GB memory, 10GB Nic, 11HDDs/11SSDs/PCIE SSD)
2. Real customer case NWeight(graph analysis), which is to compute associations
between two vertices that are n-hop away(e.g., friend-to-friend or
video-to-video relationship for recommendation).
3. Data Size: 22GB, Vertices: 41 milion, Edges: 1.4 billion.

*Usage*
1. Set the priority and threshold for each layer in
spark.storage.hierarchyStore.
{code}
spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all
the rest form the last layer.

2. Configure each layer's location, user just needs put the keyword like "nvm",
"ssd", which are specified in step 1, into local dirs, like spark.local.dir or
yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm
first.
When nvm's usable space is less than 50GB, it starts to allocate from ssd.
When ssd's usable space is less than 80GB, it starts to allocate from the last
layer.

was:
*Problem*
Nowadays, users have both SSDs and HDDs.
SSDs have great performance, but capacity is small. HDDs have good capacity,
but x2-x3 lower than SSDs.
How can we get both good?

*Solution*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup
storage.
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets
blocks from SSDs first, and when SSD’s useable space is less than some
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

> Store blocks in different speed storage devices by hierarchy way
>
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
>

[jira] [Resolved] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()


 [ 
https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12585.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10528
[https://github.com/apache/spark/pull/10528]

> The numFields of UnsafeRow should not changed by pointTo()
> --
>
> Key: SPARK-12585
> URL: https://issues.apache.org/jira/browse/SPARK-12585
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes 
> is calculated, making pointTo() a little bit heavy.
> It should be part of constructor of UnsafeRow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12414) Remove closure serializer


[ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075814#comment-15075814
 ] 

Andrew Or edited comment on SPARK-12414 at 12/31/15 7:51 AM:
-

It's also for code cleanup. Right now SparkEnv has a "closure serializer" and a 
"serializer", which is kind of confusing. We should just use Java serializer 
since it's worked for such a long time. I don't know much about Kryo 3.0 but 
I'm not sure if upgrading would be sufficient.


was (Author: andrewor14):
It's also for code cleanup. Right now SparkEnv has a "closure serializer" and a 
"serializer", which is kind of confusing. We should just use Java serializer 
since it's worked for such a long time. Not sure about Kryo 3.0 but I'm not 
sure if upgrading would be sufficient.

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-529) Have a single file that controls the environmental variables and spark config options


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-529:
--

Assignee: (was: Apache Spark)

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-529) Have a single file that controls the environmental variables and spark config options


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-529:
--

Assignee: Apache Spark

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12196) Store/retrieve blocks in different speed storage devices by hierarchy way

2015-12-30 Thread yucai (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

yucai updated SPARK-12196:
--
Description:
*Problem*
Nowadays, users have both SSDs and HDDs.
SSDs have great performance, but capacity is small.
HDDs have good capacity, but much slower than SSDs(x2-x3 slower than SATA SSD,
x20 slower than PCIe SSD).
How can we get both good?

In our implementation, we actually go further. We support a way to build any
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

was:
*Problem*
Nowadays, users have both SSDs and HDDs.
SSDs have great performance, but capacity is small. HDDs have good capacity,
but x2-x3 lower than SSDs.
How can we get both good?

In our implementation, we actually go further. We support a way to build any
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).

> Store/retrieve blocks in different speed storage devices by hierarchy way
> -
>
> Key: SPARK-12196
>

[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch


[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075757#comment-15075757
 ] 

Shixiong Zhu commented on SPARK-6166:
-

Sure

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch


 [ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-6166:
---

Assignee: Shixiong Zhu

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Assignee: Shixiong Zhu
>Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12516) Properly handle NM failure situation for Spark on Yarn

2015-12-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075693#comment-15075693
 ] 

Saisai Shao commented on SPARK-12516:
-

Hi [~vanzin], what is your suggestion of this issue? I'm failed to figure out a 
proper solution to address this issue.

> Properly handle NM failure situation for Spark on Yarn
> --
>
> Key: SPARK-12516
> URL: https://issues.apache.org/jira/browse/SPARK-12516
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Failure of NodeManager will make all the executors belong to that NM exit 
> silently.
> Currently in the implementation of YarnSchedulerBackend, driver will receive 
> onDisconnect event when executor is lost, which will further ask AM to get 
> the lost reason, AM will hold this query connection until RM report back the 
> status of lost container, and reply back to driver. In the case of NM 
> failure, RM cannot detect this failure immediately until timeout (10 mins by 
> default), so the driver query of lost reason will be timed out (120 seconds), 
> after timed out the executor states in the driver side will be cleaned out, 
> but in the AM side, this states will still be maintained until NM heartbeat 
> timeout. So this will potentially introduce some unexpected behaviors:
> ---
> * In the dynamic allocation disabled situation, executor number in the driver 
> side is less than the number in the AM side after timeout (from 120 seconds 
> to 10 minutes), and cannot be ramped up to the expected number until RM 
> detect the failure of NM and make the related containers as complected.
> {quote}
> For example the target executor number is 10, with 5 NMs (each NM has 2 
> executors). So when 1 NM is failed, 2 related executors are lost. After 
> driver side query timeout, the executor number in driver side is 8, but in AM 
> side it is still 10, so AM will not request additional containers until the 
> number in AM reaches to 8 (after 10 minutes).
> {quote}
> ---
> * When dynamic allocation is enabled, the number of target executor is 
> maintained both in the driver and AM side and synced between them. The target 
> executor number will be correct after driver query timeout (120 seconds), but 
> this number is incorrect in the AM side until NM failure is detected (10 
> minutes). In such case the actual executor number is less than the calculated 
> one.
> {quote}
> For example, current target executor number in driver is N, and in AM side is 
> M, so M - N is the lost number.
> When the executor number needs to ramp up to A, so the actual number will be 
> A - (M - N).
> When the executor number needs to bring down to B, so the actual number will 
> be max(0, B - (M - N)). when the actual number of executors is 0, the whole 
> system is hang, will only be recovered if driver request more resources, or 
> after 10 minutes timeout.
> This can be reproduced by running SparkPi example in the yarn-client mode 
> with follow configurations:
> spark.dynamicAllocation.enabledtrue
> spark.shuffle.service.enabled  true
> spark.dynamicAllocation.minExecutors 1
> spark.dynamicAllocation.initialExecutors 2
> spark.dynamicAllocation.maxExecutors 3
> In the middle of job, killing one NM which only has executors running.
> {quote}
> ---
> Possbile solutions:
> * Sync the actual executor number from the driver to AM after RPC timeout 
> (120 seconds), also clean the related states in the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-529) Have a single file that controls the environmental variables and spark config options


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-529:
-
Assignee: (was: Josh Rosen)

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-529) Have a single file that controls the environmental variables and spark config options


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-529:


Assignee: Josh Rosen

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12427) spark builds filling up jenkins' disk


 [ 
https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12427.

Resolution: Fixed

> spark builds filling up jenkins' disk
> -
>
> Key: SPARK-12427
> URL: https://issues.apache.org/jira/browse/SPARK-12427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: build, jenkins
> Attachments: graph.png, jenkins_disk_usage.txt
>
>
> problem summary:
> a few spark builds are filling up the jenkins master's disk with millions of 
> little log files as build artifacts.  
> currently, we have a raid10 array set up with 5.4T of storage.  we're 
> currently using 4.0T, 99.9% of which is spark unit test and junit logs.
> the worst offenders, with more than 100G of disk usage per job, are:
> 193G./Spark-1.6-Maven-with-YARN
> 194G./Spark-1.5-Maven-with-YARN
> 205G./Spark-1.6-Maven-pre-YARN
> 216G./Spark-1.5-Maven-pre-YARN
> 387G./Spark-Master-Maven-with-YARN
> 420G./Spark-Master-Maven-pre-YARN
> 520G./Spark-1.6-SBT
> 733G./Spark-1.5-SBT
> 812G./Spark-Master-SBT
> i have attached a full report w/all builds listed as well.
> each of these builds is keeping their build history for 90 days.
> keep in mind that for each new matrix build, we're looking at another 
> 200-500G per for the SBT/pre-YARN/with-YARN jobs.
> a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk 
> usage.
> on the hardware config side, we can move from raid10 to raid 5 and get ~3T 
> additional storage.  if we ditch raid altogether and put in bigger disks, we 
> can get a total of 16-20T storage on master.  another option is to have a NFS 
> mount to a deep storage server.  all of these options will require 
> significant downtime.
> quesitons:
> * can we lower the number of days that we keep build information?
> * there are other options in jenkins that we can set as well:  max number of 
> builds to keep, max # days to keep artifacts, max # of builds to keep 
> w/artifacts
> * can we make the junit and unit test logs smaller (probably not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12414) Remove closure serializer


[ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075814#comment-15075814
 ] 

Andrew Or commented on SPARK-12414:
---

It's also for code cleanup. Right now SparkEnv has a "closure serializer" and a 
"serializer", which is kind of confusing. We should just use Java serializer 
since it's worked for such a long time. Not sure about Kryo 3.0 but I'm not 
sure if upgrading would be sufficient.

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10524) Decision tree binary classification with ordered categorical features: incorrect centroid

2015-12-30 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10524:
--
Target Version/s: 2.0.0  (was: )

> Decision tree binary classification with ordered categorical features: 
> incorrect centroid
> -
>
> Key: SPARK-10524
> URL: https://issues.apache.org/jira/browse/SPARK-10524
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>
> In DecisionTree and RandomForest binary classification with ordered 
> categorical features, we order categories' bins based on the hard prediction, 
> but we should use the soft prediction.
> Here are the 2 places in mllib and ml:
> * 
> [https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L887]
> * 
> [https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L779]
> The PR which fixes this should include a unit test which isolates this issue, 
> ideally by directly calling binsToBestSplit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075271#comment-15075271
 ] 

Reynold Xin commented on SPARK-12537:
-

The point is to be able to support non-standard JSON. Unfortunately real-world 
data isn't always clean and standard conforming.



> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075230#comment-15075230
 ] 

Mario Briggs commented on SPARK-12177:
--

you could also get just a few of the records you want i.e. not all in 1 shot

override def getNext(): R = {
  if (iter == null || !iter.hasNext) {
iter = consumer.poll(pollTime).iterator()
  }

  if (!iter.hasNext) {
if ( requestOffset < part.untilOffset ) {
   // need to make another poll() and recheck above. So make a 
recursive call i.e. 'return getnext()' here ?
}
finished = true
null.asInstanceOf[R]
  } else {
   ...

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12407) ClassCast Exception when restarting spark streaming from checkpoint


[ 
https://issues.apache.org/jira/browse/SPARK-12407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075260#comment-15075260
 ] 

Shixiong Zhu commented on SPARK-12407:
--

They have the same problem. Here is a workaround to use them in Streaming: 
https://github.com/apache/spark/pull/10385

> ClassCast Exception when restarting spark streaming from checkpoint
> ---
>
> Key: SPARK-12407
> URL: https://issues.apache.org/jira/browse/SPARK-12407
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Bartlomiej Alberski
>
> I am receiving ClassCast Exception when restarting streaming application from 
> checkpoint:
> {code}
> java.lang.ClassCastException: [B cannot be cast to 
> pl.example.spark.StreamingTestReporter
>   at 
> pl.example.spark.StreamingTest$$anonfun$createContext$1$$anonfun$apply$2$$anonfun$apply$4$$anonfun$apply$5.apply(StreamingTest.scala:38)
>   at 
> pl.example.spark.StreamingTest$$anonfun$createContext$1$$anonfun$apply$2$$anonfun$apply$4$$anonfun$apply$5.apply(StreamingTest.scala:36)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> pl.example.spark.StreamingTest$$anonfun$createContext$1$$anonfun$apply$2$$anonfun$apply$4.apply(StreamingTest.scala:36)
>   at 
> pl.example.spark.StreamingTest$$anonfun$createContext$1$$anonfun$apply$2$$anonfun$apply$4.apply(StreamingTest.scala:33)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> It looks like the problem is connected with instance of class broadcasted to 
> the executors. I think that when restoring from checkpoint Id of broadcasted 
> value changes - this is the reason why read from broadcasted memory block 
> something else (instead of instance of our class). In my production code I 
> received sligthly different exception:
> {code}
> java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration 
> cannot be cast to com.example.sender.MyClassReporter
> {code}
> Below there are links for spark user list with discussion about issue as well 
> as minimal example that helps in reproducing issue.
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-java-lang-ClassCastException-org-apache-spark-util-SerializableConfiguration-on-restt-td25698.html
> https://github.com/alberskib/spark-streaming-broadcast-issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12533) hiveContext.table() throws the wrong exception

2015-12-30 Thread Thomas Sebastian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075278#comment-15075278
 ] 

Thomas Sebastian commented on SPARK-12533:
--

Hi Michael,
I was able to replicate the issue. I am working on the fix.

> hiveContext.table() throws the wrong exception
> --
>
> Key: SPARK-12533
> URL: https://issues.apache.org/jira/browse/SPARK-12533
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> This should throw an {{AnalysisException}} that includes the table name 
> instead of the following:
> {code}
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:122)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:384)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:458)
>   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:458)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:830)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:826)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075282#comment-15075282
 ] 

Sean Owen commented on SPARK-12537:
---

Of course I appreciate that. Supporting non-standard input sounds helpful, but 
adds its own subtler problems: for example, why is the correct interpretation 
of "John D\oe" the string "John Doe"? JSON says neither is correct.

At the least, the suggestion here seems to be to have a default that's not 
consistent with JSON's spec, Python, or Jackson. The default should be current 
behavior, agreed on by all three and Spark: reject the input.

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()

Davies Liu created SPARK-12585:
--

 Summary: The numFields of UnsafeRow should not changed by pointTo()
 Key: SPARK-12585
 URL: https://issues.apache.org/jira/browse/SPARK-12585
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is 
calculated, making pointTo() a little bit heavy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12584) How to do the merging of 2 JavaScemaRDD data in spark 1.2.0 version. The similar features we can achieved in spark 1.4.1 using "UNIONALL", but i am unable to find any

2015-12-30 Thread Rijash Patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075277#comment-15075277
 ] 

Rijash Patel commented on SPARK-12584:
--

Dear Sean,

Sorry to ask this, actually i am new to this forum.
Could you please help me to know, why this has been marked as Invalid?
Is, that we can't achieve JavaSchemaRDD merging in spark 1.2.0? 

> How to do the merging of 2 JavaScemaRDD data in spark 1.2.0 version. The 
> similar features we can achieved in spark 1.4.1 using "UNIONALL", but i am 
> unable to find any transformation for spark 1.2.0.
> --
>
> Key: SPARK-12584
> URL: https://issues.apache.org/jira/browse/SPARK-12584
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Rijash Patel
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075172#comment-15075172
 ] 

Mario Briggs commented on SPARK-12177:
--

Nikita,

thank you. 

A-C : Looks good to me. (BTW i didn't review changes related to receiver based 
approach, even in earlier round)

D - I think it is OK for KafkaTestUtils to have dependency on core, since that 
is more of our internal test approach (however i havent spent time to think if 
even that can be bettered). To the higher issue, i think Kafka *will* provide 
TopicPartition as serializable, which will make this moot, but good that we 
have tracked it here

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075205#comment-15075205
 ] 

Mario Briggs commented on SPARK-12177:
--

Very good point about creation of KafkaConsumer frequently. In fact, Praveen is 
investigating if that is reason the 'position()' method hangs when we have 
batch intervals at 200ms and below.
 So one way to try to optimize it is this way : since the 'compute' method in 
DirectKafkaInputDStream runs in the driver, why not store the 'KafkaConsumer' 
rather than the KafkaCluster as a member variable in this class. Of course we 
will need to mark it transient, so that its not attempted to be serialized and 
that means always check if null and re-initialize if required, before use. The 
only use of the Consumer here is to find the new latest offsets, so we will 
have to massage that method for use with an existing consumer object .
Or another option is to let KafkaCluster have a KafkaConsumer instance as a 
member variable with same noted aspects about being transient.

This also means, move the part about fetching the leader ipAddress for 
getPreferredLocations() away from KafkaRDD.getPartitions() to 
DirectKafkaInputDStream.compute() and have 'leaders' as constructor param to 
KafkaRDD ( i now realize that KafkaRDD is private so we are not having that on 
a public API as i thought earlier)
  

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075230#comment-15075230
 ] 

Mario Briggs edited comment on SPARK-12177 at 12/30/15 5:24 PM:


you could also get just a few of the records you want i.e. not all in 1 shot. 
So a gist below

override def getNext(): R = {
  if (iter == null || !iter.hasNext) {
iter = consumer.poll(pollTime).iterator()
  }

  if (!iter.hasNext) {
if ( requestOffset < part.untilOffset ) {

   // need to make another poll() and recheck above. So make a 
recursive call i.e. 'return getnext()' here ?

}
finished = true
null.asInstanceOf[R]
  } else {
   ...


was (Author: mariobriggs):
you could also get just a few of the records you want i.e. not all in 1 shot

override def getNext(): R = {
  if (iter == null || !iter.hasNext) {
iter = consumer.poll(pollTime).iterator()
  }

  if (!iter.hasNext) {
if ( requestOffset < part.untilOffset ) {
   // need to make another poll() and recheck above. So make a 
recursive call i.e. 'return getnext()' here ?
}
finished = true
null.asInstanceOf[R]
  } else {
   ...

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()


 [ 
https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12585:
---
Description: 
Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is 
calculated, making pointTo() a little bit heavy.

It should be part of constructor of UnsafeRow.

  was:Right now, numFields will be passed in by pointTo(), then 
bitSetWidthInBytes is calculated, making pointTo() a little bit heavy.


> The numFields of UnsafeRow should not changed by pointTo()
> --
>
> Key: SPARK-12585
> URL: https://issues.apache.org/jira/browse/SPARK-12585
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes 
> is calculated, making pointTo() a little bit heavy.
> It should be part of constructor of UnsafeRow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()


 [ 
https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12585:
--

Assignee: Davies Liu

> The numFields of UnsafeRow should not changed by pointTo()
> --
>
> Key: SPARK-12585
> URL: https://issues.apache.org/jira/browse/SPARK-12585
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes 
> is calculated, making pointTo() a little bit heavy.
> It should be part of constructor of UnsafeRow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075307#comment-15075307
 ] 

Reynold Xin commented on SPARK-12537:
-

Note that there is really no "standard" to deal with non-standard JSON files. 
Each library does its own thing along the permissiveness spectrum. Python and 
Jackson make different decisions for different settings.

In general I think for big data, we'd want to err on the permissive side, 
because it is fairly annoying to find an error later on and have to rerun the 
pipelines with different settings.




> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12300) Fix schema inferance on local collections


 [ 
https://issues.apache.org/jira/browse/SPARK-12300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12300:
---
Fix Version/s: (was: 1.6.0)
   1.6.1

> Fix schema inferance on local collections
> -
>
> Key: SPARK-12300
> URL: https://issues.apache.org/jira/browse/SPARK-12300
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Current schema inferance for local python collections halts as soon as there 
> are no NullTypes. This is different than when we specify a sampling ratio of 
> 1.0 on a distributed collection. This could result in incomplete schema 
> information.
> Repro:
> {code}
> input = [{"a": 1}, {"b": "coffee"}]
> df = sqlContext.createDataFrame(input)
> print df.schema
> {code}
> Discovered while looking at SPARK-2870



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-12582:
-

Assignee: Andrew Or

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>Assignee: Andrew Or
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12582:
--
Component/s: (was: Shuffle)
 Windows
 Tests

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12582:
--
Assignee: (was: Apache Spark)
Target Version/s: 1.6.1, 2.0.0

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12582:
--
Assignee: yucai  (was: Andrew Or)

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>Assignee: yucai
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12582:


Assignee: Apache Spark

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>Assignee: Apache Spark
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12567) Add aes_{encrypt,decrypt} UDFs


 [ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12567:


Assignee: (was: Apache Spark)

> Add aes_{encrypt,decrypt} UDFs
> --
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12567) Add aes_{encrypt,decrypt} UDFs


 [ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12567:


Assignee: Apache Spark

> Add aes_{encrypt,decrypt} UDFs
> --
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>Assignee: Apache Spark
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12587) Make parts of the Spark SQL testing API public

2015-12-30 Thread holdenk (JIRA)

holdenk created SPARK-12587:
---

 Summary: Make parts of the Spark SQL testing API public
 Key: SPARK-12587
 URL: https://issues.apache.org/jira/browse/SPARK-12587
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Reporter: holdenk
Priority: Trivial


See parent JIRA for design doc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12495) use true as default value for propagateNull in NewInstance

2015-12-30 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12495.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10443
[https://github.com/apache/spark/pull/10443]

> use true as default value for propagateNull in NewInstance
> --
>
> Key: SPARK-12495
> URL: https://issues.apache.org/jira/browse/SPARK-12495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2015-12-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075318#comment-15075318
 ] 

Sean Owen commented on SPARK-12537:
---

Yeah, by definition, but you've already observed that Python, Jackson and Spark 
do one thing. Since Spark uses Jackson, it's least surprising to follow its 
default. What accepts this non-standard JSON by default and what does it do?

The flip side to your argument is: you can silently corrupt input by making 
this the default. It really needs to be opt in.

Being able to pass through the flag seems fine to me, but I'm strongly against 
changing the default behavior in Spark to not match other known libraries, 
especially given the downside.

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12300) Fix schema inferance on local collections


 [ 
https://issues.apache.org/jira/browse/SPARK-12300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12300.

   Resolution: Fixed
Fix Version/s: 1.6.0
   2.0.0

Issue resolved by pull request 10275
[https://github.com/apache/spark/pull/10275]

> Fix schema inferance on local collections
> -
>
> Key: SPARK-12300
> URL: https://issues.apache.org/jira/browse/SPARK-12300
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>Priority: Minor
> Fix For: 2.0.0, 1.6.0
>
>
> Current schema inferance for local python collections halts as soon as there 
> are no NullTypes. This is different than when we specify a sampling ratio of 
> 1.0 on a distributed collection. This could result in incomplete schema 
> information.
> Repro:
> {code}
> input = [{"a": 1}, {"b": "coffee"}]
> df = sqlContext.createDataFrame(input)
> print df.schema
> {code}
> Discovered while looking at SPARK-2870



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()


[ 
https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075334#comment-15075334
 ] 

Apache Spark commented on SPARK-12585:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10528

> The numFields of UnsafeRow should not changed by pointTo()
> --
>
> Key: SPARK-12585
> URL: https://issues.apache.org/jira/browse/SPARK-12585
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes 
> is calculated, making pointTo() a little bit heavy.
> It should be part of constructor of UnsafeRow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()


 [ 
https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12585:


Assignee: Apache Spark  (was: Davies Liu)

> The numFields of UnsafeRow should not changed by pointTo()
> --
>
> Key: SPARK-12585
> URL: https://issues.apache.org/jira/browse/SPARK-12585
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes 
> is calculated, making pointTo() a little bit heavy.
> It should be part of constructor of UnsafeRow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12586) Wrong answer with registerTempTable and union sql query

2015-12-30 Thread shao lo (JIRA)

shao lo created SPARK-12586:
---

 Summary: Wrong answer with registerTempTable and union sql query
 Key: SPARK-12586
 URL: https://issues.apache.org/jira/browse/SPARK-12586
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.2
 Environment: Windows 7
Reporter: shao lo


The following python script gets the wrong answer unless workarounds are used...

from pyspark import SparkContext
from pyspark.sql import SQLContext


if __name__ == "__main__":
sc = SparkContext(appName="PythonSQLbug")
sqlContext = SQLContext(sc)

data = [(v,) for v in range(1, 5)]
values = sqlContext.createDataFrame(data, ['v_value'])
values.registerTempTable("values")
values.show()

data = [
(3, 1, 1, 1, None),
(2, 1, 1, 1, 3),
(3, 2, 1, 1, None),
(3, 3, 1, 1, 2),
(3, 4, 1, 2, None)]
df1 = sqlContext.createDataFrame(data, ['row', 'col', 'foo', 'bar', 
'value'])
df1.registerTempTable("t0")
df1.show()

sql_text = """select row, col, foo, bar, value2 value
from (select row, col, foo, bar, 8 value2 from t0 where row=1 
and col=2) s1
  union select row, col, foo, bar, value from t0 where not 
(row=1 and col=2)"""
df2 = sqlContext.sql(sql_text)
df2.registerTempTable("t1")

# # The following 2 line workaround fixes the problem somehow?
# df3 = sqlContext.createDataFrame(df2.collect())
# df3.registerTempTable("t1")

# # The following 4 line workaround fixes the problem too..but takes way 
longer
# filename = "t1.json"
# df2.write.json(filename, mode='overwrite')
# df3 = sqlContext.read.json(filename)
# df3.registerTempTable("t1")

sql_text2 = """select row, col, v1 value from
(select v1 from
(select v_value v1 from values) s1
  left join
(select value v2,foo,bar,row,col from t1
  where foo=1
and bar=2 and value is not null) s2
  on v1=v2 where v2 is null
) sa join
(select row, col from t1 where foo=1
and bar=2 and value is null) sb"""
result = sqlContext.sql(sql_text2)
result.show()

# Expected result
# +---+---+-+
# |row|col|value|
# +---+---+-+
# |  3|  4|1|
# |  3|  4|2|
# |  3|  4|3|
# |  3|  4|4|
# +---+---+-+

# Getting this wrong result...when not using the workarounds above
# +---+---+-+
# |row|col|value|
# +---+---+-+
# +---+---+-+

sc.stop()




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12554) Standalone mode may hang if max cores is not a multiple of executor cores


 [ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12554:
--
Priority: Minor  (was: Major)

> Standalone mode may hang if max cores is not a multiple of executor cores
> -
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>Priority: Minor
>
> In scheduleExecutorsOnWorker() in Master.scala,
> {{val keepScheduling = coresToAssign >= minCoresPerExecutor}} should be 
> changed to {{val keepScheduling = coresToAssign > 0}}
> Case 1: 
> Suppose that an app's requested cores is 10 (i.e., {{spark.cores.max = 10}}) 
> and app.coresPerExecutor is 4 (i.e., {{spark.executor.cores = 4}}). 
> After allocating two executors (each has 4 cores) to this app, the 
> {{app.coresToAssign = 2}} and {{minCoresPerExecutor = coresPerExecutor = 4}}, 
> so {{keepScheduling = false}} and no extra executor will be allocated to this 
> app. If {{spark.scheduler.minRegisteredResourcesRatio}} is set to a large 
> number (e.g., > 0.8 in this case), the app will hang and never finish.
> Case 2: if a small app's coresPerExecutor is larger than its requested cores 
> (e.g., {{spark.cores.max = 10}}, {{spark.executor.cores = 16}}), {{val 
> keepScheduling = coresToAssign >= minCoresPerExecutor}} is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12554) Standalone mode may hang if max cores is not a multiple of executor cores


 [ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12554:
--
Summary: Standalone mode may hang if max cores is not a multiple of 
executor cores  (was: Standalone app scheduler will hang when app.coreToAssign 
< minCoresPerExecutor)

> Standalone mode may hang if max cores is not a multiple of executor cores
> -
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>
> In scheduleExecutorsOnWorker() in Master.scala,
> {{val keepScheduling = coresToAssign >= minCoresPerExecutor}} should be 
> changed to {{val keepScheduling = coresToAssign > 0}}
> Case 1: 
> Suppose that an app's requested cores is 10 (i.e., {{spark.cores.max = 10}}) 
> and app.coresPerExecutor is 4 (i.e., {{spark.executor.cores = 4}}). 
> After allocating two executors (each has 4 cores) to this app, the 
> {{app.coresToAssign = 2}} and {{minCoresPerExecutor = coresPerExecutor = 4}}, 
> so {{keepScheduling = false}} and no extra executor will be allocated to this 
> app. If {{spark.scheduler.minRegisteredResourcesRatio}} is set to a large 
> number (e.g., > 0.8 in this case), the app will hang and never finish.
> Case 2: if a small app's coresPerExecutor is larger than its requested cores 
> (e.g., {{spark.cores.max = 10}}, {{spark.executor.cores = 16}}), {{val 
> keepScheduling = coresToAssign >= minCoresPerExecutor}} is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests


 [ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10359:
---
Component/s: Project Infra

> Enumerate Spark's dependencies in a file and diff against it for new pull 
> requests 
> ---
>
> Key: SPARK-10359
> URL: https://issues.apache.org/jira/browse/SPARK-10359
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Sometimes when we have dependency changes it can be pretty unclear what 
> transitive set of things are changing. If we enumerate all of the 
> dependencies and put them in a source file in the repo, we can make it so 
> that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests


 [ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10359.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10461
[https://github.com/apache/spark/pull/10461]

> Enumerate Spark's dependencies in a file and diff against it for new pull 
> requests 
> ---
>
> Key: SPARK-10359
> URL: https://issues.apache.org/jira/browse/SPARK-10359
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Sometimes when we have dependency changes it can be pretty unclear what 
> transitive set of things are changing. If we enumerate all of the 
> dependencies and put them in a source file in the repo, we can make it so 
> that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12586) Wrong answer with registerTempTable and union sql query

2015-12-30 Thread shao lo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shao lo updated SPARK-12586:

Attachment: sql_bug.py

Attaching example code

> Wrong answer with registerTempTable and union sql query
> ---
>
> Key: SPARK-12586
> URL: https://issues.apache.org/jira/browse/SPARK-12586
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: Windows 7
>Reporter: shao lo
> Attachments: sql_bug.py
>
>
> The following python script gets the wrong answer unless workarounds are 
> used...
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> if __name__ == "__main__":
> sc = SparkContext(appName="PythonSQLbug")
> sqlContext = SQLContext(sc)
> data = [(v,) for v in range(1, 5)]
> values = sqlContext.createDataFrame(data, ['v_value'])
> values.registerTempTable("values")
> values.show()
> data = [
> (3, 1, 1, 1, None),
> (2, 1, 1, 1, 3),
> (3, 2, 1, 1, None),
> (3, 3, 1, 1, 2),
> (3, 4, 1, 2, None)]
> df1 = sqlContext.createDataFrame(data, ['row', 'col', 'foo', 'bar', 
> 'value'])
> df1.registerTempTable("t0")
> df1.show()
> sql_text = """select row, col, foo, bar, value2 value
> from (select row, col, foo, bar, 8 value2 from t0 where row=1 
> and col=2) s1
>   union select row, col, foo, bar, value from t0 where 
> not (row=1 and col=2)"""
> df2 = sqlContext.sql(sql_text)
> df2.registerTempTable("t1")
> # # The following 2 line workaround fixes the problem somehow?
> # df3 = sqlContext.createDataFrame(df2.collect())
> # df3.registerTempTable("t1")
> # # The following 4 line workaround fixes the problem too..but takes way 
> longer
> # filename = "t1.json"
> # df2.write.json(filename, mode='overwrite')
> # df3 = sqlContext.read.json(filename)
> # df3.registerTempTable("t1")
> sql_text2 = """select row, col, v1 value from
> (select v1 from
> (select v_value v1 from values) s1
>   left join
> (select value v2,foo,bar,row,col from t1
>   where foo=1
> and bar=2 and value is not null) s2
>   on v1=v2 where v2 is null
> ) sa join
> (select row, col from t1 where foo=1
> and bar=2 and value is null) sb"""
> result = sqlContext.sql(sql_text2)
> result.show()
> 
> # Expected result
> # +---+---+-+
> # |row|col|value|
> # +---+---+-+
> # |  3|  4|1|
> # |  3|  4|2|
> # |  3|  4|3|
> # |  3|  4|4|
> # +---+---+-+
> # Getting this wrong result...when not using the workarounds above
> # +---+---+-+
> # |row|col|value|
> # +---+---+-+
> # +---+---+-+
> 
> sc.stop()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12554) Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor


[ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075353#comment-15075353
 ] 

Andrew Or commented on SPARK-12554:
---

[~jerrylead] The change you are proposing changes the semantics incorrectly.

The right fix here is to adjust the wait behavior. Right now it doesn't take 
into account the fact that executors can have fixed number of cores, such that 
we never get to `spark.cores.max`. All we need to do is to use the right max 
when waiting for resources, e.g. in your example instead of waiting for all 10 
cores, we wait for the nearest multiple of 4, i.e. 8.

Your case 2 is not at all a bug. The user chose settings that are impossible to 
fulfill. Although there's nothing to fix we can throw an exception to fail the 
application quickly, but no one really runs into this so it's probably not 
worth doing.

> Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor
> --
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>
> In scheduleExecutorsOnWorker() in Master.scala,
> {{val keepScheduling = coresToAssign >= minCoresPerExecutor}} should be 
> changed to {{val keepScheduling = coresToAssign > 0}}
> Case 1: 
> Suppose that an app's requested cores is 10 (i.e., {{spark.cores.max = 10}}) 
> and app.coresPerExecutor is 4 (i.e., {{spark.executor.cores = 4}}). 
> After allocating two executors (each has 4 cores) to this app, the 
> {{app.coresToAssign = 2}} and {{minCoresPerExecutor = coresPerExecutor = 4}}, 
> so {{keepScheduling = false}} and no extra executor will be allocated to this 
> app. If {{spark.scheduler.minRegisteredResourcesRatio}} is set to a large 
> number (e.g., > 0.8 in this case), the app will hang and never finish.
> Case 2: if a small app's coresPerExecutor is larger than its requested cores 
> (e.g., {{spark.cores.max = 10}}, {{spark.executor.cores = 16}}), {{val 
> keepScheduling = coresToAssign >= minCoresPerExecutor}} is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12586) Wrong answer with registerTempTable and union sql query

2015-12-30 Thread shao lo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shao lo updated SPARK-12586:

Description: 
The following sequence of sql(), registerTempTable() calls gets the wrong 
answer.
The correct answer is returned if the temp table is rewritten?

sql_text = """select row, col, foo, bar, value2 value
from (select row, col, foo, bar, 8 value2 from t0 where row=1 
and col=2) s1
  union select row, col, foo, bar, value from t0 where not 
(row=1 and col=2)"""
df2 = sqlContext.sql(sql_text)
df2.registerTempTable("t1")

# # The following 2 line workaround fixes the problem somehow?
# df3 = sqlContext.createDataFrame(df2.collect())
# df3.registerTempTable("t1")

# # The following 4 line workaround fixes the problem too..but takes way 
longer
# filename = "t1.json"
# df2.write.json(filename, mode='overwrite')
# df3 = sqlContext.read.json(filename)
# df3.registerTempTable("t1")

sql_text2 = """select row, col, v1 value from
(select v1 from
(select v_value v1 from values) s1
  left join
(select value v2,foo,bar,row,col from t1
  where foo=1
and bar=2 and value is not null) s2
  on v1=v2 where v2 is null
) sa join
(select row, col from t1 where foo=1
and bar=2 and value is null) sb"""
result = sqlContext.sql(sql_text2)
result.show()

# Expected result
# +---+---+-+
# |row|col|value|
# +---+---+-+
# |  3|  4|1|
# |  3|  4|2|
# |  3|  4|3|
# |  3|  4|4|
# +---+---+-+

# Getting this wrong result...when not using the workarounds above
# +---+---+-+
# |row|col|value|
# +---+---+-+
# +---+---+-+


  was:
The following python script gets the wrong answer unless workarounds are used...

from pyspark import SparkContext
from pyspark.sql import SQLContext


if __name__ == "__main__":
sc = SparkContext(appName="PythonSQLbug")
sqlContext = SQLContext(sc)

data = [(v,) for v in range(1, 5)]
values = sqlContext.createDataFrame(data, ['v_value'])
values.registerTempTable("values")
values.show()

data = [
(3, 1, 1, 1, None),
(2, 1, 1, 1, 3),
(3, 2, 1, 1, None),
(3, 3, 1, 1, 2),
(3, 4, 1, 2, None)]
df1 = sqlContext.createDataFrame(data, ['row', 'col', 'foo', 'bar', 
'value'])
df1.registerTempTable("t0")
df1.show()

sql_text = """select row, col, foo, bar, value2 value
from (select row, col, foo, bar, 8 value2 from t0 where row=1 
and col=2) s1
  union select row, col, foo, bar, value from t0 where not 
(row=1 and col=2)"""
df2 = sqlContext.sql(sql_text)
df2.registerTempTable("t1")

# # The following 2 line workaround fixes the problem somehow?
# df3 = sqlContext.createDataFrame(df2.collect())
# df3.registerTempTable("t1")

# # The following 4 line workaround fixes the problem too..but takes way 
longer
# filename = "t1.json"
# df2.write.json(filename, mode='overwrite')
# df3 = sqlContext.read.json(filename)
# df3.registerTempTable("t1")

sql_text2 = """select row, col, v1 value from
(select v1 from
(select v_value v1 from values) s1
  left join
(select value v2,foo,bar,row,col from t1
  where foo=1
and bar=2 and value is not null) s2
  on v1=v2 where v2 is null
) sa join
(select row, col from t1 where foo=1
and bar=2 and value is null) sb"""
result = sqlContext.sql(sql_text2)
result.show()

# Expected result
# +---+---+-+
# |row|col|value|
# +---+---+-+
# |  3|  4|1|
# |  3|  4|2|
# |  3|  4|3|
# |  3|  4|4|
# +---+---+-+

# Getting this wrong result...when not using the workarounds above
# +---+---+-+
# |row|col|value|
# +---+---+-+
# +---+---+-+

sc.stop()



> Wrong answer with registerTempTable and union sql query
> ---
>
> Key: SPARK-12586
> URL: https://issues.apache.org/jira/browse/SPARK-12586
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: Windows 7
>Reporter: shao lo
> Attachments: sql_bug.py
>
>
> The following sequence of sql(), registerTempTable() calls gets the wrong 
> answer.
> The correct answer is returned if the temp table is rewritten?
> sql_text = """select row, col, foo, bar, value2 value
> from (select row, col, foo, bar, 8 value2 from t0 where

[jira] [Commented] (SPARK-7689) Deprecate spark.cleaner.ttl


[ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075527#comment-15075527
 ] 

Josh Rosen commented on SPARK-7689:
---

See also: 
https://issues.apache.org/jira/browse/SPARK-5594?focusedCommentId=14485780=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14485780

> Deprecate spark.cleaner.ttl
> ---
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
> advises users against using it, since it's an unsafe configuration option 
> that can lead to confusing behavior if it's misused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12561) Remove JobLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12561:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12561) Remove JobLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12561:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Reynold Xin
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7689) Deprecate spark.cleaner.ttl


 [ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7689:
--
Target Version/s: 2.0.0  (was: 1.5.0)

> Deprecate spark.cleaner.ttl
> ---
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
> advises users against using it, since it's an unsafe configuration option 
> that can lead to confusing behavior if it's misused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7689) Deprecate spark.cleaner.ttl


 [ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-7689:
-

Assignee: Josh Rosen

> Deprecate spark.cleaner.ttl
> ---
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
> advises users against using it, since it's an unsafe configuration option 
> that can lead to confusing behavior if it's misused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-7689) Deprecate spark.cleaner.ttl


 [ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-7689:
---

> Deprecate spark.cleaner.ttl
> ---
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
> advises users against using it, since it's an unsafe configuration option 
> that can lead to confusing behavior if it's misused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7689) Remove TTL-based metadata cleaning (spark.cleaner.ttl)


 [ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7689:
--
Summary: Remove TTL-based metadata cleaning (spark.cleaner.ttl)  (was: 
Deprecate spark.cleaner.ttl)

> Remove TTL-based metadata cleaning (spark.cleaner.ttl)
> --
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
> advises users against using it, since it's an unsafe configuration option 
> that can lead to confusing behavior if it's misused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7689) Remove TTL-based metadata cleaning (spark.cleaner.ttl)


 [ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7689:
--
Description: 
With the introduction of ContextCleaner, I think there's no longer any reason 
for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
RDDs or broadcast variables in your REPL history and having them never get 
cleaned up, although I think this is an uncommon use-case).  I think that this 
property used to be relevant for Spark Streaming jobs, but I think that's no 
longer the case since the latest Streaming docs have removed all mentions of 
{{spark.cleaner.ttl}} (see 
https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
 for example).

See 
http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
 for an old, related discussion.  Also, see 
https://github.com/apache/spark/pull/126, the PR that introduced the new 
ContextCleaner mechanism.

For Spark 2.0, I think we should remove {{spark.cleaner.ttl}} and the 
associated TTL-based metadata cleaning code.

  was:
With the introduction of ContextCleaner, I think there's no longer any reason 
for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
RDDs or broadcast variables in your REPL history and having them never get 
cleaned up, although I think this is an uncommon use-case).  I think that this 
property used to be relevant for Spark Streaming jobs, but I think that's no 
longer the case since the latest Streaming docs have removed all mentions of 
{{spark.cleaner.ttl}} (see 
https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
 for example).

See 
http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
 for an old, related discussion.  Also, see 
https://github.com/apache/spark/pull/126, the PR that introduced the new 
ContextCleaner mechanism.

We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
advises users against using it, since it's an unsafe configuration option that 
can lead to confusing behavior if it's misused.


> Remove TTL-based metadata cleaning (spark.cleaner.ttl)
> --
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> For Spark 2.0, I think we should remove {{spark.cleaner.ttl}} and the 
> associated TTL-based metadata cleaning code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7689) Remove TTL-based metadata cleaning (spark.cleaner.ttl)


[ 
https://issues.apache.org/jira/browse/SPARK-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075573#comment-15075573
 ] 

Josh Rosen commented on SPARK-7689:
---

Note that the periodic GC was implemented for Spark 1.6.

> Remove TTL-based metadata cleaning (spark.cleaner.ttl)
> --
>
> Key: SPARK-7689
> URL: https://issues.apache.org/jira/browse/SPARK-7689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> With the introduction of ContextCleaner, I think there's no longer any reason 
> for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
> perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
> RDDs or broadcast variables in your REPL history and having them never get 
> cleaned up, although I think this is an uncommon use-case).  I think that 
> this property used to be relevant for Spark Streaming jobs, but I think 
> that's no longer the case since the latest Streaming docs have removed all 
> mentions of {{spark.cleaner.ttl}} (see 
> https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
>  for example).
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
>  for an old, related discussion.  Also, see 
> https://github.com/apache/spark/pull/126, the PR that introduced the new 
> ContextCleaner mechanism.
> For Spark 2.0, I think we should remove {{spark.cleaner.ttl}} and the 
> associated TTL-based metadata cleaning code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9757) Can't create persistent data source tables with decimal


[ 
https://issues.apache.org/jira/browse/SPARK-9757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075575#comment-15075575
 ] 

Apache Spark commented on SPARK-9757:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10533

> Can't create persistent data source tables with decimal
> ---
>
> Key: SPARK-9757
> URL: https://issues.apache.org/jira/browse/SPARK-9757
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.0
>
>
> {{ParquetHiveSerDe}} in Hive versions < 1.2.0 doesn't support decimal. 
> Persisting Parquet relations to metastore of such versions (say 0.13.1) 
> throws the following exception after SPARK-6923.
> {code}
> Caused by: java.lang.UnsupportedOperationException: Parquet does not support 
> decimal. See HIVE-6384
>   at 
> org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:102)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:60)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:576)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:358)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:356)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:356)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:356)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:351)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:198)
>   at 
> org.apache.spark.sql.hive.execution.CreateMetastoreDataSource.run(commands.scala:152)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12481) Remove usage of Hadoop deprecated APIs and reflection that supported 1.x


 [ 
https://issues.apache.org/jira/browse/SPARK-12481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12481:

Assignee: Sean Owen  (was: Apache Spark)

> Remove usage of Hadoop deprecated APIs and reflection that supported 1.x
> 
>
> Key: SPARK-12481
> URL: https://issues.apache.org/jira/browse/SPARK-12481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Streaming
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> Many API calls that were deprecated as of Hadoop 2.2 can be fixed now to use 
> the non-deprecated methods. Also, some reflection-based acrobatics to support 
> 2.x and 1.x can be removed now too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12588) Remove HTTPBroadcast

Josh Rosen created SPARK-12588:
--

 Summary: Remove HTTPBroadcast
 Key: SPARK-12588
 URL: https://issues.apache.org/jira/browse/SPARK-12588
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


For Spark 2.0, we should remove HTTPBroadcast and standardize on 
TorrentBroadcast as the only broadcast implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-529) Have a single file that controls the environmental variables and spark config options

2015-12-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-529:
--

I'm reopening this since I believe it's a worthy addition; in fact, Spark SQL 
already has something similar, and I'm just refactoring that code a little for 
use in the other modules.

(It's not a single file per-se, but the spirit is the same - one location where 
a particular config option is defined - name, type and default value).

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12561) Remove JobLogger


[ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075499#comment-15075499
 ] 

Apache Spark commented on SPARK-12561:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10530

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12561) Remove JobLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12561:


Assignee: Andrew Or  (was: Apache Spark)

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12561) Remove JobLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12561:


Assignee: Apache Spark  (was: Andrew Or)

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11806) Spark 2.0 deprecations and removals


 [ 
https://issues.apache.org/jira/browse/SPARK-11806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11806:

Labels: releasenotes  (was: )

> Spark 2.0 deprecations and removals
> ---
>
> Key: SPARK-11806
> URL: https://issues.apache.org/jira/browse/SPARK-11806
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> This is an umbrella ticket to track things we are deprecating and removing in 
> Spark 2.0.
> All sub-tasks are currently assigned to Reynold to prevent others from 
> picking up prematurely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12588) Remove HTTPBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12588:

Labels: releasenotes  (was: )

> Remove HTTPBroadcast
> 
>
> Key: SPARK-12588
> URL: https://issues.apache.org/jira/browse/SPARK-12588
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>  Labels: releasenotes
>
> For Spark 2.0, we should remove HTTPBroadcast and standardize on 
> TorrentBroadcast as the only broadcast implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame


 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12148:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-12169

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12588) Remove HTTPBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12588:

Description: We switched to TorrentBroadcast in Spark 1.1, and 
HttpBroadcast has been undocumented since then. It's time to remove it in Spark 
2.0.For Spark 2.0, we should remove HTTPBroadcast and standardize on 
TorrentBroadcast as the only broadcast implementation.  (was: For Spark 2.0, we 
should remove HTTPBroadcast and standardize on TorrentBroadcast as the only 
broadcast implementation.)

> Remove HTTPBroadcast
> 
>
> Key: SPARK-12588
> URL: https://issues.apache.org/jira/browse/SPARK-12588
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>  Labels: releasenotes
>
> We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been 
> undocumented since then. It's time to remove it in Spark 2.0.For Spark 2.0, 
> we should remove HTTPBroadcast and standardize on TorrentBroadcast as the 
> only broadcast implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12561) Remove JobLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12561:


Assignee: Apache Spark  (was: Andrew Or)

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12588) Remove HTTPBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12588:


Assignee: Apache Spark  (was: Josh Rosen)

> Remove HTTPBroadcast
> 
>
> Key: SPARK-12588
> URL: https://issues.apache.org/jira/browse/SPARK-12588
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>  Labels: releasenotes
>
> We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been 
> undocumented since then. It's time to remove it in Spark 2.0.For Spark 2.0, 
> we should remove HTTPBroadcast and standardize on TorrentBroadcast as the 
> only broadcast implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12588) Remove HTTPBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-12588:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Remove HTTPBroadcast
> 
>
> Key: SPARK-12588
> URL: https://issues.apache.org/jira/browse/SPARK-12588
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been 
> undocumented since then. It's time to remove it in Spark 2.0.For Spark 2.0, 
> we should remove HTTPBroadcast and standardize on TorrentBroadcast as the 
> only broadcast implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12561) Remove JobLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-12561:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Reynold Xin
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12588) Remove HTTPBroadcast


[ 
https://issues.apache.org/jira/browse/SPARK-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075507#comment-15075507
 ] 

Apache Spark commented on SPARK-12588:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10531

> Remove HTTPBroadcast
> 
>
> Key: SPARK-12588
> URL: https://issues.apache.org/jira/browse/SPARK-12588
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been 
> undocumented since then. It's time to remove it in Spark 2.0.For Spark 2.0, 
> we should remove HTTPBroadcast and standardize on TorrentBroadcast as the 
> only broadcast implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12206) Streaming WebUI shows incorrect batch statistics when using Window operations


[ 
https://issues.apache.org/jira/browse/SPARK-12206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075508#comment-15075508
 ] 

Shixiong Zhu commented on SPARK-12206:
--

Because only batches that are multiple times of slide duration will run, 
"StreamingJobProgressListener" cannot get info of other batches.

/cc [~tdas] what do you think about this one?

> Streaming WebUI shows incorrect batch statistics when using Window operations
> -
>
> Key: SPARK-12206
> URL: https://issues.apache.org/jira/browse/SPARK-12206
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Anand Iyer
>Priority: Minor
> Attachments: streaming-webui.png
>
>
> I have a streaming app that uses the Window(...) function to create a sliding 
> window, and perform transformations on the window'd DStream.
> The Batch statistics section of the Streaming UI starts displaying stats for 
> each Window, instead of each micro-batch. Is that expected behavior?
> The "Input Size" column shows incorrect values. The streaming application is 
> receiving about 1K events/sec. However, the "Input Size" column shows values 
> in the single digits or low double digits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12588) Remove HTTPBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12588:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove HTTPBroadcast
> 
>
> Key: SPARK-12588
> URL: https://issues.apache.org/jira/browse/SPARK-12588
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>  Labels: releasenotes
>
> We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been 
> undocumented since then. It's time to remove it in Spark 2.0.For Spark 2.0, 
> we should remove HTTPBroadcast and standardize on TorrentBroadcast as the 
> only broadcast implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12588) Remove HTTPBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12588:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove HTTPBroadcast
> 
>
> Key: SPARK-12588
> URL: https://issues.apache.org/jira/browse/SPARK-12588
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been 
> undocumented since then. It's time to remove it in Spark 2.0.For Spark 2.0, 
> we should remove HTTPBroadcast and standardize on TorrentBroadcast as the 
> only broadcast implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3873) Scala style: check import ordering

2015-12-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075511#comment-15075511
 ] 

Marcelo Vanzin commented on SPARK-3873:
---

The style checker is in but only generates warnings now; I'll leave the bug 
open so we can clean up the source base before enabling errors for import order 
violations.

> Scala style: check import ordering
> --
>
> Key: SPARK-3873
> URL: https://issues.apache.org/jira/browse/SPARK-3873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12561) Remove JobLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12561:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12561) Remove JobLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-12561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12561:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove JobLogger
> 
>
> Key: SPARK-12561
> URL: https://issues.apache.org/jira/browse/SPARK-12561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Reynold Xin
>
> It was research code and has been deprecated since 1.0.0. No one really uses 
> it since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12582:


Assignee: (was: Apache Spark)

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Reporter: yucai
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12582:


Assignee: Apache Spark

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Reporter: yucai
>Assignee: Apache Spark
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12427) spark builds filling up jenkins' disk


[ 
https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075413#comment-15075413
 ] 

Josh Rosen commented on SPARK-12427:


I did a significant amount of cleanup last week and it looks like we have 
plenty of free space. Thoughts on closing this ticket?

> spark builds filling up jenkins' disk
> -
>
> Key: SPARK-12427
> URL: https://issues.apache.org/jira/browse/SPARK-12427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Critical
>  Labels: build, jenkins
> Attachments: graph.png, jenkins_disk_usage.txt
>
>
> problem summary:
> a few spark builds are filling up the jenkins master's disk with millions of 
> little log files as build artifacts.  
> currently, we have a raid10 array set up with 5.4T of storage.  we're 
> currently using 4.0T, 99.9% of which is spark unit test and junit logs.
> the worst offenders, with more than 100G of disk usage per job, are:
> 193G./Spark-1.6-Maven-with-YARN
> 194G./Spark-1.5-Maven-with-YARN
> 205G./Spark-1.6-Maven-pre-YARN
> 216G./Spark-1.5-Maven-pre-YARN
> 387G./Spark-Master-Maven-with-YARN
> 420G./Spark-Master-Maven-pre-YARN
> 520G./Spark-1.6-SBT
> 733G./Spark-1.5-SBT
> 812G./Spark-Master-SBT
> i have attached a full report w/all builds listed as well.
> each of these builds is keeping their build history for 90 days.
> keep in mind that for each new matrix build, we're looking at another 
> 200-500G per for the SBT/pre-YARN/with-YARN jobs.
> a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk 
> usage.
> on the hardware config side, we can move from raid10 to raid 5 and get ~3T 
> additional storage.  if we ditch raid altogether and put in bigger disks, we 
> can get a total of 16-20T storage on master.  another option is to have a NFS 
> mount to a deep storage server.  all of these options will require 
> significant downtime.
> quesitons:
> * can we lower the number of days that we keep build information?
> * there are other options in jenkins that we can set as well:  max number of 
> builds to keep, max # days to keep artifacts, max # of builds to keep 
> w/artifacts
> * can we make the junit and unit test logs smaller (probably not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6155) Support latest Scala (2.11.6+)


[ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075418#comment-15075418
 ] 

Josh Rosen commented on SPARK-6155:
---

Closing as "Cannot Reproduce" since this seems to only affect old versions of 
Spark. Please re-open / re-file if this is still an issue.

> Support latest Scala (2.11.6+)
> --
>
> Key: SPARK-6155
> URL: https://issues.apache.org/jira/browse/SPARK-6155
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>
> Just tried to build with Scala 2.11.5. failed with following error message:
> [INFO] Compiling 9 Scala sources to 
> /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
> [ERROR] 
> /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
>  value withIncompleteHandler is not a member of 
> SparkIMain.this.global.PerRunReporting
> [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) => 
> isIncomplete = true) {
> [ERROR]^
> Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
> http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6155) Support latest Scala (2.11.6+)


 [ 
https://issues.apache.org/jira/browse/SPARK-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen closed SPARK-6155.
-
Resolution: Cannot Reproduce

> Support latest Scala (2.11.6+)
> --
>
> Key: SPARK-6155
> URL: https://issues.apache.org/jira/browse/SPARK-6155
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>
> Just tried to build with Scala 2.11.5. failed with following error message:
> [INFO] Compiling 9 Scala sources to 
> /Users/jianshuang/workspace/others/spark/repl/target/scala-2.11/classes...
> [ERROR] 
> /Users/jianshuang/workspace/others/spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkIMain.scala:1132:
>  value withIncompleteHandler is not a member of 
> SparkIMain.this.global.PerRunReporting
> [ERROR]   currentRun.reporting.withIncompleteHandler((_, _) => 
> isIncomplete = true) {
> [ERROR]^
> Looks like PerRunParsing has been changed from Reporting to Parsing in 2.11.5
> http://fossies.org/diffs/scala-sources/2.11.2_vs_2.11.5/src/compiler/scala/tools/nsc/Reporting.scala-diff.html
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12391) JDBC OR operator push down


 [ 
https://issues.apache.org/jira/browse/SPARK-12391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12391.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro  (was: Apache Spark)
Fix Version/s: 2.0.0

> JDBC OR operator push down
> --
>
> Key: SPARK-12391
> URL: https://issues.apache.org/jira/browse/SPARK-12391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.0.0
>
>
> For SQL OR operator such as
> SELECT *
> FROM table_name
> WHERE column_name1  =  value1 OR  column_name2  = value2
> Will push down to JDBC datasource



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12409) JDBC AND operator push down


 [ 
https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12409.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.0.0

> JDBC AND operator push down 
> 
>
> Key: SPARK-12409
> URL: https://issues.apache.org/jira/browse/SPARK-12409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.0.0
>
>
> For simple AND such as 
> select * from test where THEID = 1 AND NAME = 'fred', 
> The filters pushed down to JDBC layers are EqualTo(THEID,1), 
> EqualTo(Name,fred). These are handled OK by the current code. 
> For query such as 
> SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
> the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
> So need to add And filter in JDBC layer.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12387) JDBC IN operator push down


 [ 
https://issues.apache.org/jira/browse/SPARK-12387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12387.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro  (was: Apache Spark)
Fix Version/s: 2.0.0

> JDBC  IN operator push down
> ---
>
> Key: SPARK-12387
> URL: https://issues.apache.org/jira/browse/SPARK-12387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Huaxin Gao
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.0.0
>
>
> For SQL IN operator such as
> SELECT column_name(s)
> FROM table_name
> WHERE column_name IN (value1,value2,...)
> Currently this is not pushed down for JDBC datasource.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data


 [ 
https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5763.
---
Resolution: Won't Fix

Resolving as "Won't fix" for now, given discussion on the PR RE: this 
functionality being provided as part of Spark SQL / DataFrames.

> Sort-based Groupby and Join to resolve skewed data
> --
>
> Key: SPARK-5763
> URL: https://issues.apache.org/jira/browse/SPARK-5763
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Lianhui Wang
>
> In SPARK-4644, it provide a way to resolve skewed data. But when we has more 
> keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So 
> we can use sort-merge to resolve skewed-groupby and skewed-join.because 
> SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based 
> on SPARK-2926. And i have implemented sort-merge-groupby and it is very well 
> for skewed data in my test.Later i will implement sort-merge-join to resolve 
> skewed-join.
> [~rxin] [~sandyr] [~andrewor14] how about your opinions about this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition


[ 
https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075448#comment-15075448
 ] 

Josh Rosen commented on SPARK-5581:
---

Update: although the code in question has moved and the comment has changed, I 
believe that this issue is still relevant in some cases.

> When writing sorted map output file, avoid open / close between each partition
> --
>
> Key: SPARK-5581
> URL: https://issues.apache.org/jira/browse/SPARK-5581
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>
> {code}
>   // Bypassing merge-sort; get an iterator by partition and just write 
> everything directly.
>   for ((id, elements) <- this.partitionedIterator) {
> if (elements.hasNext) {
>   val writer = blockManager.getDiskWriter(
> blockId, outputFile, ser, fileBufferSize, 
> context.taskMetrics.shuffleWriteMetrics.get)
>   for (elem <- elements) {
> writer.write(elem)
>   }
>   writer.commitAndClose()
>   val segment = writer.fileSegment()
>   lengths(id) = segment.length
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5659) Flaky test: o.a.s.streaming.ReceiverSuite.block