[jira] [Commented] (SPARK-14264) Add feature importances for GBTs in Pyspark
[ https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217386#comment-15217386 ] Apache Spark commented on SPARK-14264: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/12056 > Add feature importances for GBTs in Pyspark > --- > > Key: SPARK-14264 > URL: https://issues.apache.org/jira/browse/SPARK-14264 > Project: Spark > Issue Type: New Feature >Reporter: Seth Hendrickson >Priority: Minor > > GBT feature importances are now implemented in scala. We should expose them > in the pyspark API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14264) Add feature importances for GBTs in Pyspark
[ https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14264: Assignee: Apache Spark > Add feature importances for GBTs in Pyspark > --- > > Key: SPARK-14264 > URL: https://issues.apache.org/jira/browse/SPARK-14264 > Project: Spark > Issue Type: New Feature >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > GBT feature importances are now implemented in scala. We should expose them > in the pyspark API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14264) Add feature importances for GBTs in Pyspark
[ https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14264: Assignee: (was: Apache Spark) > Add feature importances for GBTs in Pyspark > --- > > Key: SPARK-14264 > URL: https://issues.apache.org/jira/browse/SPARK-14264 > Project: Spark > Issue Type: New Feature >Reporter: Seth Hendrickson >Priority: Minor > > GBT feature importances are now implemented in scala. We should expose them > in the pyspark API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14264) Add feature importances for GBTs in Pyspark
Seth Hendrickson created SPARK-14264: Summary: Add feature importances for GBTs in Pyspark Key: SPARK-14264 URL: https://issues.apache.org/jira/browse/SPARK-14264 Project: Spark Issue Type: New Feature Reporter: Seth Hendrickson Priority: Minor GBT feature importances are now implemented in scala. We should expose them in the pyspark API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates
[ https://issues.apache.org/jira/browse/SPARK-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14263: Assignee: Apache Spark > Benchmark Vectorized HashMap for GroupBy Aggregates > --- > > Key: SPARK-14263 > URL: https://issues.apache.org/jira/browse/SPARK-14263 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sameer Agarwal >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates
[ https://issues.apache.org/jira/browse/SPARK-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14263: Assignee: (was: Apache Spark) > Benchmark Vectorized HashMap for GroupBy Aggregates > --- > > Key: SPARK-14263 > URL: https://issues.apache.org/jira/browse/SPARK-14263 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates
[ https://issues.apache.org/jira/browse/SPARK-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217375#comment-15217375 ] Apache Spark commented on SPARK-14263: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/12055 > Benchmark Vectorized HashMap for GroupBy Aggregates > --- > > Key: SPARK-14263 > URL: https://issues.apache.org/jira/browse/SPARK-14263 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates
Sameer Agarwal created SPARK-14263: -- Summary: Benchmark Vectorized HashMap for GroupBy Aggregates Key: SPARK-14263 URL: https://issues.apache.org/jira/browse/SPARK-14263 Project: Spark Issue Type: New Feature Components: SQL Reporter: Sameer Agarwal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13112) CoarsedExecutorBackend register to driver should wait Executor was ready
[ https://issues.apache.org/jira/browse/SPARK-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217362#comment-15217362 ] meiyoula commented on SPARK-13112: -- I agree with it. When CoarseGrainedExecutorBackend receives RegisterExecutorResponse slow after LaunchTask, it will occurs the problem. I think we can't make sure CoarseGrainedExecutorBackend receives RegisterExecutorResponse before LaunchTask. Maybe CoarseGrainedExecutorBackend is busy to send itself RegisterExecutorResponse, and receives LaunchTask message first. > CoarsedExecutorBackend register to driver should wait Executor was ready > > > Key: SPARK-13112 > URL: https://issues.apache.org/jira/browse/SPARK-13112 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: SuYan > > desc: > due to some host's disk are busy, it will results failed in timeoutException > while executor try to register to shuffler server on that host... > and then it will exit(1) while launch task on a null executor. > and yarn cluster resource are a little busy, yarn will thought that host is > idle, it will prefer to allocate the same host executor, so it will have a > chance that one task failed 4 times in the same host. > currently, CoarsedExecutorBackend register to driver first, and after > registerDriver successful, then initial Executor. > if exception occurs in Executor initialization, > But Driver don't know that event, will still launch task in that executor, > then will call system.exit(1). > {code} > override def receive: PartialFunction[Any, Unit] = { > case RegisteredExecutor(hostname) => > logInfo("Successfully registered with driver") executor = new > Executor(executorId, hostname, env, userClassPath, isLocal = false) > .. > case LaunchTask(data) => >if (executor == null) { > logError("Received LaunchTask command but executor was null") > System.exit(1) > {code} > It is more reasonable to register with driver after Executor is ready... and > make registerTimeout to be configurable... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13060) CoarsedExecutorBackend register to driver should wait Executor was ready?
[ https://issues.apache.org/jira/browse/SPARK-13060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217359#comment-15217359 ] meiyoula commented on SPARK-13060: -- When CoarseGrainedExecutorBackend receives RegisterExecutorResponse slow after LaunchTask, it will occurs this problem. I met it in an actual spark cluster. > CoarsedExecutorBackend register to driver should wait Executor was ready? > - > > Key: SPARK-13060 > URL: https://issues.apache.org/jira/browse/SPARK-13060 > Project: Spark > Issue Type: Bug >Reporter: SuYan >Priority: Trivial > > Hi Josh > I am spark user, currently I feel confused about executor registration. > {code} > Why CoarsedExecutorBackend register to driver first, and after registerDriver > successful, then initial Executor. > override def receive: PartialFunction[Any, Unit] = { > case RegisteredExecutor(hostname) => > logInfo("Successfully registered with driver") > executor = new Executor(executorId, hostname, env, userClassPath, isLocal > = false) > > {code} > and it is strange that exit(1) while launchTask with executor=null(due to > some errors occurs in Executor initialization). > {code} > case LaunchTask(data) => > if (executor == null) { > logError("Received LaunchTask command but executor was null") > System.exit(1) > {code} > What concerns was considered to make registerDriver first, Why not register > to driver until Executor sth. have been all ready? > like > {code} > val (hostname, _) = Utils.extractHostPortFromSparkUrl(hostPort) > val executor: Executor = new Executor(executorId, hostname, env, > userClassPath, isLocal = false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13060) CoarsedExecutorBackend register to driver should wait Executor was ready?
[ https://issues.apache.org/jira/browse/SPARK-13060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-13060: - Comment: was deleted (was: When CoarseGrainedExecutorBackend receives RegisterExecutorResponse slow after LaunchTask, it will occurs this problem. I met it in an actual spark cluster. ) > CoarsedExecutorBackend register to driver should wait Executor was ready? > - > > Key: SPARK-13060 > URL: https://issues.apache.org/jira/browse/SPARK-13060 > Project: Spark > Issue Type: Bug >Reporter: SuYan >Priority: Trivial > > Hi Josh > I am spark user, currently I feel confused about executor registration. > {code} > Why CoarsedExecutorBackend register to driver first, and after registerDriver > successful, then initial Executor. > override def receive: PartialFunction[Any, Unit] = { > case RegisteredExecutor(hostname) => > logInfo("Successfully registered with driver") > executor = new Executor(executorId, hostname, env, userClassPath, isLocal > = false) > > {code} > and it is strange that exit(1) while launchTask with executor=null(due to > some errors occurs in Executor initialization). > {code} > case LaunchTask(data) => > if (executor == null) { > logError("Received LaunchTask command but executor was null") > System.exit(1) > {code} > What concerns was considered to make registerDriver first, Why not register > to driver until Executor sth. have been all ready? > like > {code} > val (hostname, _) = Utils.extractHostPortFromSparkUrl(hostPort) > val executor: Executor = new Executor(executorId, hostname, env, > userClassPath, isLocal = false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13060) CoarsedExecutorBackend register to driver should wait Executor was ready?
[ https://issues.apache.org/jira/browse/SPARK-13060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217360#comment-15217360 ] meiyoula commented on SPARK-13060: -- When CoarseGrainedExecutorBackend receives RegisterExecutorResponse slow after LaunchTask, it will occurs this problem. I met it in an actual spark cluster. > CoarsedExecutorBackend register to driver should wait Executor was ready? > - > > Key: SPARK-13060 > URL: https://issues.apache.org/jira/browse/SPARK-13060 > Project: Spark > Issue Type: Bug >Reporter: SuYan >Priority: Trivial > > Hi Josh > I am spark user, currently I feel confused about executor registration. > {code} > Why CoarsedExecutorBackend register to driver first, and after registerDriver > successful, then initial Executor. > override def receive: PartialFunction[Any, Unit] = { > case RegisteredExecutor(hostname) => > logInfo("Successfully registered with driver") > executor = new Executor(executorId, hostname, env, userClassPath, isLocal > = false) > > {code} > and it is strange that exit(1) while launchTask with executor=null(due to > some errors occurs in Executor initialization). > {code} > case LaunchTask(data) => > if (executor == null) { > logError("Received LaunchTask command but executor was null") > System.exit(1) > {code} > What concerns was considered to make registerDriver first, Why not register > to driver until Executor sth. have been all ready? > like > {code} > val (hostname, _) = Utils.extractHostPortFromSparkUrl(hostPort) > val executor: Executor = new Executor(executorId, hostname, env, > userClassPath, isLocal = false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14254) Add logs to help investigate the network performance
[ https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-14254. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.0.0 > Add logs to help investigate the network performance > > > Key: SPARK-14254 > URL: https://issues.apache.org/jira/browse/SPARK-14254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > It would be very helpful for network performance investigation if we log the > time spent on connecting and resolving host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14262) Correct app state after master leader changed
[ https://issues.apache.org/jira/browse/SPARK-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14262: Assignee: Apache Spark > Correct app state after master leader changed > - > > Key: SPARK-14262 > URL: https://issues.apache.org/jira/browse/SPARK-14262 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1 >Reporter: ZhengYaofeng >Assignee: Apache Spark >Priority: Minor > > Suppose that deploy spark in a cluster and enable its recovery mode with > zookeeper. > 1) start spark cluster and submit several apps > 2) After theses apps' states change to RUNNING, shutdown one master > 3) another master becomes leader and recovers these apps. these apps' states > become WAITING instead of RUNNING, although they are really at work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14262) Correct app state after master leader changed
[ https://issues.apache.org/jira/browse/SPARK-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14262: Assignee: (was: Apache Spark) > Correct app state after master leader changed > - > > Key: SPARK-14262 > URL: https://issues.apache.org/jira/browse/SPARK-14262 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1 >Reporter: ZhengYaofeng >Priority: Minor > > Suppose that deploy spark in a cluster and enable its recovery mode with > zookeeper. > 1) start spark cluster and submit several apps > 2) After theses apps' states change to RUNNING, shutdown one master > 3) another master becomes leader and recovers these apps. these apps' states > become WAITING instead of RUNNING, although they are really at work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14262) Correct app state after master leader changed
[ https://issues.apache.org/jira/browse/SPARK-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217348#comment-15217348 ] Apache Spark commented on SPARK-14262: -- User 'GavinGavinNo1' has created a pull request for this issue: https://github.com/apache/spark/pull/12054 > Correct app state after master leader changed > - > > Key: SPARK-14262 > URL: https://issues.apache.org/jira/browse/SPARK-14262 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1 >Reporter: ZhengYaofeng >Priority: Minor > > Suppose that deploy spark in a cluster and enable its recovery mode with > zookeeper. > 1) start spark cluster and submit several apps > 2) After theses apps' states change to RUNNING, shutdown one master > 3) another master becomes leader and recovers these apps. these apps' states > become WAITING instead of RUNNING, although they are really at work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14262) Correct app state after master leader changed
ZhengYaofeng created SPARK-14262: Summary: Correct app state after master leader changed Key: SPARK-14262 URL: https://issues.apache.org/jira/browse/SPARK-14262 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1, 1.5.2 Reporter: ZhengYaofeng Priority: Minor Suppose that deploy spark in a cluster and enable its recovery mode with zookeeper. 1) start spark cluster and submit several apps 2) After theses apps' states change to RUNNING, shutdown one master 3) another master becomes leader and recovers these apps. these apps' states become WAITING instead of RUNNING, although they are really at work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217324#comment-15217324 ] Stefan Krawczyk commented on SPARK-14033: - {quote}It actually makes it significantly easier to maintain because it eliminates a lot of duplicated code. This duplicated functionality between Estimator & Model (in setters, schema validation, etc.) leads to inconsistencies and bugs; I actually found bugs in StringIndexer and RFormula while prototyping the merge of StringIndexer & Model.{quote} That smells like a different abstraction issue to me. But sure, I can see where you're coming from. One argument for keeping them separate, is that rom a dependency standpoint, it'd be advantageous to separate training code (estimators) from prediction code (models). That way you could package your trained models and use them without having to bring in all the training dependencies. > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13703) Remove obsolete scala-2.10 source files
[ https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luciano Resende closed SPARK-13703. --- Resolution: Not A Problem > Remove obsolete scala-2.10 source files > --- > > Key: SPARK-13703 > URL: https://issues.apache.org/jira/browse/SPARK-13703 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Luciano Resende >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochun Liang updated SPARK-14261: --- Attachment: MemorySnapshot.PNG This is the memory snapshot with query runs over 11 hours > Memory leak in Spark Thrift Server > -- > > Key: SPARK-14261 > URL: https://issues.apache.org/jira/browse/SPARK-14261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xiaochun Liang > Attachments: MemorySnapshot.PNG > > > I am running Spark Thrift server on Windows Server 2012. The Spark Thrift > server is launched as Yarn client mode. Its memory usage is increased > gradually with the queries in. I am wondering there is memory leak in Spark > Thrift server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochun Liang updated SPARK-14261: --- Description: I am running Spark Thrift server on Windows Server 2012. The Spark Thrift server is launched as Yarn client mode. Its memory usage is increased gradually with the queries in. I am wondering there is memory leak in Spark Thrift server. (was: I am running Spark Thrift server on Windows Server 2012. The Spark Thrift server is launched as Yarn client mode. Its memory usage is increased gradually with the queries in. ) > Memory leak in Spark Thrift Server > -- > > Key: SPARK-14261 > URL: https://issues.apache.org/jira/browse/SPARK-14261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xiaochun Liang > > I am running Spark Thrift server on Windows Server 2012. The Spark Thrift > server is launched as Yarn client mode. Its memory usage is increased > gradually with the queries in. I am wondering there is memory leak in Spark > Thrift server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14261) Memory leak in Spark Thrift Server
Xiaochun Liang created SPARK-14261: -- Summary: Memory leak in Spark Thrift Server Key: SPARK-14261 URL: https://issues.apache.org/jira/browse/SPARK-14261 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Xiaochun Liang I am running Spark Thrift server on Windows Server 2012. The Spark Thrift server is launched as Yarn client mode. Its memory usage is increased gradually with the queries in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14260) Increase default value for maxCharsPerColumn
[ https://issues.apache.org/jira/browse/SPARK-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217262#comment-15217262 ] Hyukjin Kwon commented on SPARK-14260: -- I am currently not sure if this value does not affect performance. Maybe I should investigate this and will close if it does matter. > Increase default value for maxCharsPerColumn > > > Key: SPARK-14260 > URL: https://issues.apache.org/jira/browse/SPARK-14260 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > I guess the default value of the option {{maxCharsPerColumn}} looks > relatively small,100 characters meaning 976KB. > It looks some of guys have a problem with this ending up setting the value > manually. > https://github.com/databricks/spark-csv/issues/295 > https://issues.apache.org/jira/browse/SPARK-14103 > According to [univocity > API|http://docs.univocity.com/parsers/2.0.0/com/univocity/parsers/common/CommonSettings.html#setMaxCharsPerColumn(int)], > this exists to avoid {{OutOfMemoryErrors}}. > If this does not harm performance, then I think it would be better to make > the default value much bigger (eg. 10MB or 100MB) so that users do not take > care of the lengths of each field in CSV file. > Apparently Apache CSV Parser does not have such limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14260) Increase default value for maxCharsPerColumn
Hyukjin Kwon created SPARK-14260: Summary: Increase default value for maxCharsPerColumn Key: SPARK-14260 URL: https://issues.apache.org/jira/browse/SPARK-14260 Project: Spark Issue Type: Sub-task Reporter: Hyukjin Kwon Priority: Trivial I guess the default value of the option {{maxCharsPerColumn}} looks relatively small,100 characters meaning 976KB. It looks some of guys have a problem with this ending up setting the value manually. https://github.com/databricks/spark-csv/issues/295 https://issues.apache.org/jira/browse/SPARK-14103 According to [univocity API|http://docs.univocity.com/parsers/2.0.0/com/univocity/parsers/common/CommonSettings.html#setMaxCharsPerColumn(int)], this exists to avoid {{OutOfMemoryErrors}}. If this does not harm performance, then I think it would be better to make the default value much bigger (eg. 10MB or 100MB) so that users do not take care of the lengths of each field in CSV file. Apparently Apache CSV Parser does not have such limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217235#comment-15217235 ] zhengruifeng commented on SPARK-14174: -- Faster KMeans is required in many cases: high dimensionity > 1000,000 big K > 1000 lager dataset (unlabeld data is much more easy to obtained than labeled data, and the data size is ususlly much larger than that used in supervised learning task) In practice, I have encountered a time-consuming case: I need to cluster about one billon data with 300,000 dimensions into 1000 center, I use MLLIB's KMenas on a cluster of 16 servers, and it takes about one week to end. I will compare it with BisectingKMeans on speed and cost. > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217212#comment-15217212 ] Hyukjin Kwon commented on SPARK-14103: -- For long messages, there is a JIRA opened already here, SPARK-13792. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:> (0 + 1) > / 2] > {code} > For a
[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217198#comment-15217198 ] Hyukjin Kwon edited comment on SPARK-14103 at 3/30/16 1:30 AM: --- As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throw an exception, right? Could you maybe share the error message? I (or my blind eyes) cannot find the exception message about this. was (Author: hyukjin.kwon): As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217198#comment-15217198 ] Hyukjin Kwon commented on SPARK-14103: -- As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217200#comment-15217200 ] Hyukjin Kwon commented on SPARK-14103: -- As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Issue Comment Deleted] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14103: - Comment: was deleted (was: As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this.) > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Issue Comment Deleted] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-14103: - Comment: was deleted (was: As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this.) > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217199#comment-15217199 ] Hyukjin Kwon commented on SPARK-14103: -- As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Commented] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell
[ https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217194#comment-15217194 ] Hyukjin Kwon commented on SPARK-14194: -- Yes. CSV data source internally uses {{TextInputFormat}} and it will recognise each line by CRLF by default. I just wonder if standard CSV format allows newlines within records. > spark csv reader not working properly if CSV content contains CRLF character > (newline) in the intermediate cell > --- > > Key: SPARK-14194 > URL: https://issues.apache.org/jira/browse/SPARK-14194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Kumaresh C R > > We have CSV content like below, > Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r > "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), > Municapality,","USA", "1234567" > Since there is a '\n\r' character in the row middle (to be exact in the > Address Column), when we execute the below spark code, it tries to create the > dataframe with two rows (excluding header row), which is wrong. Since we have > specified delimiter as quote (") character , why it takes the middle > character as newline character ? This creates an issue while processing the > created dataframe. > DataFrame df = > sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", delim) > .option("quote", quote) > .option("escape", escape) > .load(sourceFile); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14258) change scope of some functions in KafkaCluster
[ https://issues.apache.org/jira/browse/SPARK-14258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14258: Assignee: Apache Spark > change scope of some functions in KafkaCluster > -- > > Key: SPARK-14258 > URL: https://issues.apache.org/jira/browse/SPARK-14258 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Tao Wang >Assignee: Apache Spark >Priority: Minor > Labels: kafka > > There're some functions in KafkaCluster whose scopes is little larger. As I > found it while checking codes, nice to change them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14258) change scope of some functions in KafkaCluster
[ https://issues.apache.org/jira/browse/SPARK-14258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217192#comment-15217192 ] Apache Spark commented on SPARK-14258: -- User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/12053 > change scope of some functions in KafkaCluster > -- > > Key: SPARK-14258 > URL: https://issues.apache.org/jira/browse/SPARK-14258 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Tao Wang >Priority: Minor > Labels: kafka > > There're some functions in KafkaCluster whose scopes is little larger. As I > found it while checking codes, nice to change them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14258) change scope of some functions in KafkaCluster
[ https://issues.apache.org/jira/browse/SPARK-14258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14258: Assignee: (was: Apache Spark) > change scope of some functions in KafkaCluster > -- > > Key: SPARK-14258 > URL: https://issues.apache.org/jira/browse/SPARK-14258 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Tao Wang >Priority: Minor > Labels: kafka > > There're some functions in KafkaCluster whose scopes is little larger. As I > found it while checking codes, nice to change them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14258) change scope of some functions in KafkaCluster
Tao Wang created SPARK-14258: Summary: change scope of some functions in KafkaCluster Key: SPARK-14258 URL: https://issues.apache.org/jira/browse/SPARK-14258 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tao Wang Priority: Minor There're some functions in KafkaCluster whose scopes is little larger. As I found it while checking codes, nice to change them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14259) Add config to control maximum number of files when coalescing partitions
Nong Li created SPARK-14259: --- Summary: Add config to control maximum number of files when coalescing partitions Key: SPARK-14259 URL: https://issues.apache.org/jira/browse/SPARK-14259 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nong Li Priority: Minor The FileSourceStrategy currently has a config to control the maximum byte size of coalesced partitions. It is helpful to also have a config to control the maximum number of files as even small files have a non-trivial fixed cost. The current packing can put a lot of small files together which cases straggler tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14127) [Table related commands] Describe table
[ https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-14127: - Assignee: Andrew Or > [Table related commands] Describe table > --- > > Key: SPARK-14127 > URL: https://issues.apache.org/jira/browse/SPARK-14127 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Andrew Or > > TOK_DESCTABLE > Describe a column/table/partition (see here and here). Seems we support > DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other > syntaxes (and check if we are missing anything). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13703) Remove obsolete scala-2.10 source files
[ https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13703: Assignee: (was: Apache Spark) > Remove obsolete scala-2.10 source files > --- > > Key: SPARK-13703 > URL: https://issues.apache.org/jira/browse/SPARK-13703 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Luciano Resende >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13703) Remove obsolete scala-2.10 source files
[ https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13703: Assignee: Apache Spark > Remove obsolete scala-2.10 source files > --- > > Key: SPARK-13703 > URL: https://issues.apache.org/jira/browse/SPARK-13703 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Luciano Resende >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13703) Remove obsolete scala-2.10 source files
[ https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luciano Resende reopened SPARK-13703: - Now that it seems we are dropping Scala 2.10 for Spark 2.0, we can cleanup some Scala 2.10 specific files. > Remove obsolete scala-2.10 source files > --- > > Key: SPARK-13703 > URL: https://issues.apache.org/jira/browse/SPARK-13703 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Luciano Resende >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217170#comment-15217170 ] Miles Crawford commented on SPARK-14209: I don't think there's anything custom about our logging - we have it set to INFO, and a few specific modules are turned down, but not the ones you mention. > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217150#comment-15217150 ] Marcelo Vanzin commented on SPARK-14209: Are you using a custom log configuration? I can't find some log messages that I can see in the code, especially from BlockManagerMasterEndpoint and BlockManagerMaster. BTW I think the real culprit is executor 51 (container 000191 instead of the one you originally mention). Still isolating all the logs related to that executor to try to figure out what's going on. > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14125) View related commands
[ https://issues.apache.org/jira/browse/SPARK-14125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-14125: -- Comment: was deleted (was: User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/12014) > View related commands > - > > Key: SPARK-14125 > URL: https://issues.apache.org/jira/browse/SPARK-14125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > We should support the following commands. > TOK_ALTERVIEW_AS > TOK_ALTERVIEW_PROPERTIES > TOK_ALTERVIEW_RENAME > TOK_DROPVIEW > TOK_DROPVIEW_PROPERTIES > TOK_DESCTABLE > For TOK_ALTERVIEW_ADDPARTS/TOK_ALTERVIEW_DROPPARTS, we should throw > exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14125) View related commands
[ https://issues.apache.org/jira/browse/SPARK-14125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-14125: -- Assignee: Xiao Li > View related commands > - > > Key: SPARK-14125 > URL: https://issues.apache.org/jira/browse/SPARK-14125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li > > We should support the following commands. > TOK_ALTERVIEW_AS > TOK_ALTERVIEW_PROPERTIES > TOK_ALTERVIEW_RENAME > TOK_DROPVIEW > TOK_DROPVIEW_PROPERTIES > TOK_DESCTABLE > For TOK_ALTERVIEW_ADDPARTS/TOK_ALTERVIEW_DROPPARTS, we should throw > exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14124) Database related commands
[ https://issues.apache.org/jira/browse/SPARK-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14124. --- Resolution: Fixed Assignee: Xiao Li Fix Version/s: 2.0.0 > Database related commands > - > > Key: SPARK-14124 > URL: https://issues.apache.org/jira/browse/SPARK-14124 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li > Fix For: 2.0.0 > > > We should support the following commands. > TOK_CREATEDATABASE > TOK_DESCDATABASE > TOK_DROPDATABASE > TOK_ALTERDATABASE_PROPERTIES > For, TOK_ALTERDATABASE_OWNER, let's throw an exception for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13796) Lock release errors occur frequently in executor logs
[ https://issues.apache.org/jira/browse/SPARK-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217100#comment-15217100 ] Apache Spark commented on SPARK-13796: -- User 'nishkamravi2' has created a pull request for this issue: https://github.com/apache/spark/pull/12052 > Lock release errors occur frequently in executor logs > - > > Key: SPARK-13796 > URL: https://issues.apache.org/jira/browse/SPARK-13796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Nishkam Ravi >Priority: Minor > > Executor logs contain a lot of these error messages (irrespective of the > workload): > 16/03/08 17:53:07 ERROR executor.Executor: 1 block locks were not released by > TID = 1119 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217089#comment-15217089 ] Miles Crawford commented on SPARK-14209: Here you go, from the same application run: https://www.dropbox.com/s/0m5c5rshydhj4am/driver-log.gz?dl=0 > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217085#comment-15217085 ] Miles Crawford commented on SPARK-14209: Here you go, from the same application run: https://www.dropbox.com/s/aw7jw1q43lgvxu9/driver-logs.gz?dl=0 > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miles Crawford updated SPARK-14209: --- Comment: was deleted (was: Here you go, from the same application run: https://www.dropbox.com/s/aw7jw1q43lgvxu9/driver-logs.gz?dl=0) > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14225) Cap the length of toCommentSafeString at 128 chars
[ https://issues.apache.org/jira/browse/SPARK-14225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14225. - Resolution: Fixed Assignee: Sameer Agarwal (was: Reynold Xin) Fix Version/s: 2.0.0 > Cap the length of toCommentSafeString at 128 chars > -- > > Key: SPARK-14225 > URL: https://issues.apache.org/jira/browse/SPARK-14225 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > > I just saw a query with over 1000 fields and the generated expression comment > is pages long. At that level printing them no longer gives us much benefit. > We should just cap the length at some level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14253) Avoid registering temporary functions in Hive
[ https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217063#comment-15217063 ] Apache Spark commented on SPARK-14253: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/12051 > Avoid registering temporary functions in Hive > - > > Key: SPARK-14253 > URL: https://issues.apache.org/jira/browse/SPARK-14253 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > Spark should just handle all temporary functions ourselves instead of passing > it to Hive, which is what the HiveFunctionRegistry does at the moment. The > extra call to Hive is unnecessary and potentially slow, and it makes the > semantics of a "temporary function" confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14253) Avoid registering temporary functions in Hive
[ https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14253: Assignee: Andrew Or (was: Apache Spark) > Avoid registering temporary functions in Hive > - > > Key: SPARK-14253 > URL: https://issues.apache.org/jira/browse/SPARK-14253 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > Spark should just handle all temporary functions ourselves instead of passing > it to Hive, which is what the HiveFunctionRegistry does at the moment. The > extra call to Hive is unnecessary and potentially slow, and it makes the > semantics of a "temporary function" confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14253) Avoid registering temporary functions in Hive
[ https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14253: Assignee: Apache Spark (was: Andrew Or) > Avoid registering temporary functions in Hive > - > > Key: SPARK-14253 > URL: https://issues.apache.org/jira/browse/SPARK-14253 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark > > Spark should just handle all temporary functions ourselves instead of passing > it to Hive, which is what the HiveFunctionRegistry does at the moment. The > extra call to Hive is unnecessary and potentially slow, and it makes the > semantics of a "temporary function" confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217055#comment-15217055 ] Marcelo Vanzin commented on SPARK-14209: Are you able to share the logs from your driver? > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12382) Remove spark.mllib GBT implementation and wrap spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217051#comment-15217051 ] Apache Spark commented on SPARK-12382: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/12050 > Remove spark.mllib GBT implementation and wrap spark.ml > --- > > Key: SPARK-12382 > URL: https://issues.apache.org/jira/browse/SPARK-12382 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > After the GBT implementation is moved to spark.ml, we should remove the > implementation from spark.mllib. The MLlib GBTs will then just call the > implementation in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12382) Remove spark.mllib GBT implementation and wrap spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12382: Assignee: (was: Apache Spark) > Remove spark.mllib GBT implementation and wrap spark.ml > --- > > Key: SPARK-12382 > URL: https://issues.apache.org/jira/browse/SPARK-12382 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > After the GBT implementation is moved to spark.ml, we should remove the > implementation from spark.mllib. The MLlib GBTs will then just call the > implementation in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12382) Remove spark.mllib GBT implementation and wrap spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12382: Assignee: Apache Spark > Remove spark.mllib GBT implementation and wrap spark.ml > --- > > Key: SPARK-12382 > URL: https://issues.apache.org/jira/browse/SPARK-12382 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Apache Spark > > After the GBT implementation is moved to spark.ml, we should remove the > implementation from spark.mllib. The MLlib GBTs will then just call the > implementation in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14257: Assignee: Shixiong Zhu (was: Apache Spark) > Allow multiple continuous queries to be started from the same DataFrame > --- > > Key: SPARK-14257 > URL: https://issues.apache.org/jira/browse/SPARK-14257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now DataFrame stores Source in StreamingRelation and prevent from > reusing it among multiple continuous queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217049#comment-15217049 ] Apache Spark commented on SPARK-14257: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/12049 > Allow multiple continuous queries to be started from the same DataFrame > --- > > Key: SPARK-14257 > URL: https://issues.apache.org/jira/browse/SPARK-14257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now DataFrame stores Source in StreamingRelation and prevent from > reusing it among multiple continuous queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14257: Assignee: Apache Spark (was: Shixiong Zhu) > Allow multiple continuous queries to be started from the same DataFrame > --- > > Key: SPARK-14257 > URL: https://issues.apache.org/jira/browse/SPARK-14257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Right now DataFrame stores Source in StreamingRelation and prevent from > reusing it among multiple continuous queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14253) Avoid registering temporary functions in Hive
[ https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217047#comment-15217047 ] Liang-Chi Hsieh commented on SPARK-14253: - In fact, current HiveFunctionRegistry doesn't handle creating temporary function. It just handles looking up temporary functions and wrapping them in Hive UDF classes. Currently, the DDL commands to create and drop temporary functions are directly passed to Hive as native commands. > Avoid registering temporary functions in Hive > - > > Key: SPARK-14253 > URL: https://issues.apache.org/jira/browse/SPARK-14253 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > Spark should just handle all temporary functions ourselves instead of passing > it to Hive, which is what the HiveFunctionRegistry does at the moment. The > extra call to Hive is unnecessary and potentially slow, and it makes the > semantics of a "temporary function" confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14256) Remove parameter sqlContext from as.DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar D. Lara Yejas updated SPARK-14256: Description: Currently, the user requires to pass parameter sqlContext to both createDataFrame and as.DataFrame. Since sqlContext is a singleton global parameter, it should be optional from the signature of as.DataFrame. (was: Currently, the user requires to pass parameter sqlContext to both createDataFrame and as.DataFrame. Since sqlContext is a singleton global parameter, it should be obviated from the signature of these two methods.) > Remove parameter sqlContext from as.DataFrame > - > > Key: SPARK-14256 > URL: https://issues.apache.org/jira/browse/SPARK-14256 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Currently, the user requires to pass parameter sqlContext to both > createDataFrame and as.DataFrame. Since sqlContext is a singleton global > parameter, it should be optional from the signature of as.DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame
Shixiong Zhu created SPARK-14257: Summary: Allow multiple continuous queries to be started from the same DataFrame Key: SPARK-14257 URL: https://issues.apache.org/jira/browse/SPARK-14257 Project: Spark Issue Type: Improvement Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now DataFrame stores Source in StreamingRelation and prevent from reusing it among multiple continuous queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14256) Remove parameter sqlContext from as.DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar D. Lara Yejas updated SPARK-14256: Summary: Remove parameter sqlContext from as.DataFrame (was: Remove parameter sqlContext from as.DataFrame and createDataFrame) > Remove parameter sqlContext from as.DataFrame > - > > Key: SPARK-14256 > URL: https://issues.apache.org/jira/browse/SPARK-14256 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Currently, the user requires to pass parameter sqlContext to both > createDataFrame and as.DataFrame. Since sqlContext is a singleton global > parameter, it should be obviated from the signature of these two methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14253) Avoid registering temporary functions in Hive
[ https://issues.apache.org/jira/browse/SPARK-14253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217044#comment-15217044 ] Liang-Chi Hsieh commented on SPARK-14253: - I already add this support in https://github.com/apache/spark/pull/12036. > Avoid registering temporary functions in Hive > - > > Key: SPARK-14253 > URL: https://issues.apache.org/jira/browse/SPARK-14253 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > Spark should just handle all temporary functions ourselves instead of passing > it to Hive, which is what the HiveFunctionRegistry does at the moment. The > extra call to Hive is unnecessary and potentially slow, and it makes the > semantics of a "temporary function" confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14256) Remove parameter sqlContext from as.DataFrame and createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar D. Lara Yejas updated SPARK-14256: Description: Currently, the user requires to pass parameter sqlContext to both createDataFrame and as.DataFrame. Since sqlContext is a singleton global parameter, it should be obviated from the signature of these two methods. > Remove parameter sqlContext from as.DataFrame and createDataFrame > - > > Key: SPARK-14256 > URL: https://issues.apache.org/jira/browse/SPARK-14256 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Currently, the user requires to pass parameter sqlContext to both > createDataFrame and as.DataFrame. Since sqlContext is a singleton global > parameter, it should be obviated from the signature of these two methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14256) Remove parameter sqlContext from as.DataFrame and createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-14256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217018#comment-15217018 ] Oscar D. Lara Yejas commented on SPARK-14256: - I'm working on this one > Remove parameter sqlContext from as.DataFrame and createDataFrame > - > > Key: SPARK-14256 > URL: https://issues.apache.org/jira/browse/SPARK-14256 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14256) Remove parameter sqlContext from as.DataFrame and createDataFrame
Oscar D. Lara Yejas created SPARK-14256: --- Summary: Remove parameter sqlContext from as.DataFrame and createDataFrame Key: SPARK-14256 URL: https://issues.apache.org/jira/browse/SPARK-14256 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Oscar D. Lara Yejas -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14255) Streaming Aggregation
[ https://issues.apache.org/jira/browse/SPARK-14255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217002#comment-15217002 ] Apache Spark commented on SPARK-14255: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/12048 > Streaming Aggregation > - > > Key: SPARK-14255 > URL: https://issues.apache.org/jira/browse/SPARK-14255 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14255) Streaming Aggregation
[ https://issues.apache.org/jira/browse/SPARK-14255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14255: Assignee: Michael Armbrust (was: Apache Spark) > Streaming Aggregation > - > > Key: SPARK-14255 > URL: https://issues.apache.org/jira/browse/SPARK-14255 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14255) Streaming Aggregation
[ https://issues.apache.org/jira/browse/SPARK-14255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14255: Assignee: Apache Spark (was: Michael Armbrust) > Streaming Aggregation > - > > Key: SPARK-14255 > URL: https://issues.apache.org/jira/browse/SPARK-14255 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216999#comment-15216999 ] Seth Hendrickson commented on SPARK-12326: -- [~josephkb] When moving the helper classes to ML, I agree it will be good to make classes private where possible. However, I am not sure what you mean by "change the APIs." Also, could you give an example of what you had in mind as far as eliminating duplicate data stored in the final model? Thanks! > Move GBT implementation from spark.mllib to spark.ml > > > Key: SPARK-12326 > URL: https://issues.apache.org/jira/browse/SPARK-12326 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Several improvements can be made to gradient boosted trees, but are not > possible without moving the GBT implementation to spark.ml (e.g. > rawPrediction column, feature importance). This Jira is for moving the > current GBT implementation to spark.ml, which will have roughly the following > steps: > 1. Copy the implementation to spark.ml and change spark.ml classes to use > that implementation. Current tests will ensure that the implementations learn > exactly the same models. > 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, > InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since > eventually all tree implementations will reside in spark.ml, the helper > classes should as well. > 3. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14215) Support chained Python UDF
[ https://issues.apache.org/jira/browse/SPARK-14215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14215. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12014 [https://github.com/apache/spark/pull/12014] > Support chained Python UDF > -- > > Key: SPARK-14215 > URL: https://issues.apache.org/jira/browse/SPARK-14215 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > pyudf(pyudf(xx)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14255) Streaming Aggregation
Michael Armbrust created SPARK-14255: Summary: Streaming Aggregation Key: SPARK-14255 URL: https://issues.apache.org/jira/browse/SPARK-14255 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14224) Cannot project all columns from a table with ~1,100 columns
[ https://issues.apache.org/jira/browse/SPARK-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14224: Assignee: (was: Apache Spark) > Cannot project all columns from a table with ~1,100 columns > --- > > Key: SPARK-14224 > URL: https://issues.apache.org/jira/browse/SPARK-14224 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > I created a temporary table from 1000 genomes dataset and cached it. When I > try select all columns for even a single row I get following exception. > Setting and unsetting {{spark.sql.codegen.wholeStage}} and > {{spark.sql.codegen}} has no effect. > {code} > val vcfData = sqlContext.read.format("csv").options(Map( > "comment" -> "#", "header" -> "false", "delimiter" -> "\t" > )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf") > vcfData.registerTempTable("genomesTable") > %sql select * from genomesTable > Error in SQL statement: RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Ljava/lang/Object;)Ljava/lang/Object;" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection" > grows beyond 64 KB > /* 001 */ > /* 002 */ public java.lang.Object generate(Object[] references) { > /* 003 */ return new SpecificSafeProjection(references); > {code} > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14223) Cannot project all columns from a parquet files with ~1,100 columns
[ https://issues.apache.org/jira/browse/SPARK-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14223: Assignee: Apache Spark > Cannot project all columns from a parquet files with ~1,100 columns > --- > > Key: SPARK-14223 > URL: https://issues.apache.org/jira/browse/SPARK-14223 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hossein Falaki >Assignee: Apache Spark >Priority: Critical > Attachments: 1000genomes.gz.parquet > > > The parquet file is generated by saving first 10 rows of the 1000 genomes > dataset. When I try to run "select * from" a temp table created from the file > I get following error: > {code} > val vcfData = sqlContext.read.format("csv").options(Map( > "comment" -> "#", "header" -> "false", "delimiter" -> "\t" > )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf") > data.registerTempTable("genomesTable") > sql("select * from genomesTable limit > 10").write.format("parquet").save("/tmp/test-genome.parquet") > sqlContext.read.format("parquet").load("/tmp/test-genome.parquet").registerTempTable("parquetTable") > %sql select * from parquetTable > SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed > 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 4695, > ip-10-0-251-144.us-west-2.compute.internal): java.lang.ClassCastException: > org.apache.spark.sql.execution.vectorized.ColumnarBatch cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:237) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:230) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14224) Cannot project all columns from a table with ~1,100 columns
[ https://issues.apache.org/jira/browse/SPARK-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216909#comment-15216909 ] Apache Spark commented on SPARK-14224: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/12047 > Cannot project all columns from a table with ~1,100 columns > --- > > Key: SPARK-14224 > URL: https://issues.apache.org/jira/browse/SPARK-14224 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > I created a temporary table from 1000 genomes dataset and cached it. When I > try select all columns for even a single row I get following exception. > Setting and unsetting {{spark.sql.codegen.wholeStage}} and > {{spark.sql.codegen}} has no effect. > {code} > val vcfData = sqlContext.read.format("csv").options(Map( > "comment" -> "#", "header" -> "false", "delimiter" -> "\t" > )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf") > vcfData.registerTempTable("genomesTable") > %sql select * from genomesTable > Error in SQL statement: RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Ljava/lang/Object;)Ljava/lang/Object;" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection" > grows beyond 64 KB > /* 001 */ > /* 002 */ public java.lang.Object generate(Object[] references) { > /* 003 */ return new SpecificSafeProjection(references); > {code} > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14223) Cannot project all columns from a parquet files with ~1,100 columns
[ https://issues.apache.org/jira/browse/SPARK-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14223: Assignee: (was: Apache Spark) > Cannot project all columns from a parquet files with ~1,100 columns > --- > > Key: SPARK-14223 > URL: https://issues.apache.org/jira/browse/SPARK-14223 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hossein Falaki >Priority: Critical > Attachments: 1000genomes.gz.parquet > > > The parquet file is generated by saving first 10 rows of the 1000 genomes > dataset. When I try to run "select * from" a temp table created from the file > I get following error: > {code} > val vcfData = sqlContext.read.format("csv").options(Map( > "comment" -> "#", "header" -> "false", "delimiter" -> "\t" > )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf") > data.registerTempTable("genomesTable") > sql("select * from genomesTable limit > 10").write.format("parquet").save("/tmp/test-genome.parquet") > sqlContext.read.format("parquet").load("/tmp/test-genome.parquet").registerTempTable("parquetTable") > %sql select * from parquetTable > SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed > 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 4695, > ip-10-0-251-144.us-west-2.compute.internal): java.lang.ClassCastException: > org.apache.spark.sql.execution.vectorized.ColumnarBatch cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:237) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:230) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14224) Cannot project all columns from a table with ~1,100 columns
[ https://issues.apache.org/jira/browse/SPARK-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14224: Assignee: Apache Spark > Cannot project all columns from a table with ~1,100 columns > --- > > Key: SPARK-14224 > URL: https://issues.apache.org/jira/browse/SPARK-14224 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Hossein Falaki >Assignee: Apache Spark > > I created a temporary table from 1000 genomes dataset and cached it. When I > try select all columns for even a single row I get following exception. > Setting and unsetting {{spark.sql.codegen.wholeStage}} and > {{spark.sql.codegen}} has no effect. > {code} > val vcfData = sqlContext.read.format("csv").options(Map( > "comment" -> "#", "header" -> "false", "delimiter" -> "\t" > )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf") > vcfData.registerTempTable("genomesTable") > %sql select * from genomesTable > Error in SQL statement: RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Ljava/lang/Object;)Ljava/lang/Object;" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection" > grows beyond 64 KB > /* 001 */ > /* 002 */ public java.lang.Object generate(Object[] references) { > /* 003 */ return new SpecificSafeProjection(references); > {code} > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14223) Cannot project all columns from a parquet files with ~1,100 columns
[ https://issues.apache.org/jira/browse/SPARK-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216910#comment-15216910 ] Apache Spark commented on SPARK-14223: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/12047 > Cannot project all columns from a parquet files with ~1,100 columns > --- > > Key: SPARK-14223 > URL: https://issues.apache.org/jira/browse/SPARK-14223 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hossein Falaki >Priority: Critical > Attachments: 1000genomes.gz.parquet > > > The parquet file is generated by saving first 10 rows of the 1000 genomes > dataset. When I try to run "select * from" a temp table created from the file > I get following error: > {code} > val vcfData = sqlContext.read.format("csv").options(Map( > "comment" -> "#", "header" -> "false", "delimiter" -> "\t" > )).load("/1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf") > data.registerTempTable("genomesTable") > sql("select * from genomesTable limit > 10").write.format("parquet").save("/tmp/test-genome.parquet") > sqlContext.read.format("parquet").load("/tmp/test-genome.parquet").registerTempTable("parquetTable") > %sql select * from parquetTable > SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed > 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 4695, > ip-10-0-251-144.us-west-2.compute.internal): java.lang.ClassCastException: > org.apache.spark.sql.execution.vectorized.ColumnarBatch cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:237) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:230) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:772) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216899#comment-15216899 ] Miles Crawford commented on SPARK-14209: That's fantastic news! let me know if I can help test a fix. > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14254) Add logs to help investigate the network performance
[ https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14254: Assignee: Apache Spark > Add logs to help investigate the network performance > > > Key: SPARK-14254 > URL: https://issues.apache.org/jira/browse/SPARK-14254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Apache Spark > > It would be very helpful for network performance investigation if we log the > time spent on connecting and resolving host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14254) Add logs to help investigate the network performance
[ https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14254: Assignee: (was: Apache Spark) > Add logs to help investigate the network performance > > > Key: SPARK-14254 > URL: https://issues.apache.org/jira/browse/SPARK-14254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu > > It would be very helpful for network performance investigation if we log the > time spent on connecting and resolving host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14254) Add logs to help investigate the network performance
[ https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216893#comment-15216893 ] Apache Spark commented on SPARK-14254: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/12046 > Add logs to help investigate the network performance > > > Key: SPARK-14254 > URL: https://issues.apache.org/jira/browse/SPARK-14254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu > > It would be very helpful for network performance investigation if we log the > time spent on connecting and resolving host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14254) Add logs to help investigate the network performance
[ https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-14254: - Description: It would be very helpful for network performance investigation if we log the time spent on connecting and resolving host. (was: It would be very helpful for if we log the time spent on connecting and resolving host.) > Add logs to help investigate the network performance > > > Key: SPARK-14254 > URL: https://issues.apache.org/jira/browse/SPARK-14254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu > > It would be very helpful for network performance investigation if we log the > time spent on connecting and resolving host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14254) Add logs to help investigate the network performance
[ https://issues.apache.org/jira/browse/SPARK-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-14254: - Description: It would be very helpful for if we log the time spent on connecting and resolving host. (was: It would be very helpful if we log the time spent on connecting and resolving host.) > Add logs to help investigate the network performance > > > Key: SPARK-14254 > URL: https://issues.apache.org/jira/browse/SPARK-14254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu > > It would be very helpful for if we log the time spent on connecting and > resolving host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14254) Add logs to help investigate the network performance
Shixiong Zhu created SPARK-14254: Summary: Add logs to help investigate the network performance Key: SPARK-14254 URL: https://issues.apache.org/jira/browse/SPARK-14254 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu It would be very helpful if we log the time spent on connecting and resolving host. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216888#comment-15216888 ] Marcelo Vanzin commented on SPARK-14209: 2.0 doesn't currently suffer from this problem because, instead, it suffers from SPARK-14252. :-p It shouldn't be hard to provide a fix for 1.6, though. > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14252) Executors do not try to download remote cached blocks
[ https://issues.apache.org/jira/browse/SPARK-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216886#comment-15216886 ] Marcelo Vanzin commented on SPARK-14252: Here's a slightly updated piece of code where it's easier to show the problem: {code} val sc = new SparkContext(conf) try { val mapCount = sc.accumulator(0L) val rdd = sc.parallelize(1 to 1000, 10).map { i => mapCount += 1 i }.cache() rdd.count() println(s"Map count after first count: $mapCount") // Create a single task that will sleep and block, so that a particular executor is busy. // This should force future tasks to download cached data from that executor. println("Running sleep job..") val thread = new Thread(new Runnable() { override def run(): Unit = { rdd.mapPartitionsWithIndex { (i, iter) => if (i == 0) { Thread.sleep(TimeUnit.MINUTES.toMillis(10)) } iter }.count() } }) thread.setDaemon(true) thread.start() // Wait a few seconds to make sure the task is running (too lazy for listeners) println("Waiting for tasks to start...") TimeUnit.SECONDS.sleep(10) // Now run a job that will touch everything and should use the cached data. val cnt = rdd.map(_*2).count() println(s"Counted $cnt elements.") println("Killing sleep job.") thread.interrupt() thread.join() println(s"Map count after all tasks finished: $mapCount") } finally { sc.stop() } {code} On 1.6. I get: {noformat} Map count after first count: 1000 Map count after all tasks finished: 1000 {noformat} On 2.0 I get: {noformat} Map count after first count: 1000 Map count after all tasks finished: 1500 {noformat} > Executors do not try to download remote cached blocks > - > > Key: SPARK-14252 > URL: https://issues.apache.org/jira/browse/SPARK-14252 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 > includes SPARK-12817, which changed the caching code a bit to remove > duplication. But it seems to have removed the part where executors check > whether other executors contain the needed cached block. > In 1.6, that was done by the call to {{BlockManager.get}} in > {{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls > {{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and > thus the executor never gets block that are cached by other executors, > causing the blocks to be instead recomputed locally. > I wrote a small program that shows this. In 1.6, running with > {{--num-executors 2}}, I get 5 blocks cached on each executor, and messages > like these in the logs: > {noformat} > 16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7 > 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7 > 16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered > locally > 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 > 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 > from BlockManagerId(1, blah, 58831) > 1 > {noformat} > On 2.0, I get (almost) all the 10 partitions cached on *both* executors, > because once the second one fails to find a block locally it just recomputes > it and caches it. It never tries to download the block from the other > executor. The log messages above, which still exist in the code, don't show > up anywhere. > Here's the code I used for the above (trimmed of some other stuff from my > little test harness, so might not compile as is): > {code} > val sc = new SparkContext(conf) > try { > val rdd = sc.parallelize(1 to 1000, 10) > rdd.cache() > rdd.count() > // Create a single task that will sleep and block, so that a particular > executor is busy. > // This should force future tasks to download cached data from that > executor. > println("Running sleep job..") > val thread = new Thread(new Runnable() { > override def run(): Unit = { > rdd.mapPartitionsWithIndex { (i, iter) => > if (i == 0) { > Thread.sleep(TimeUnit.MINUTES.toMillis(10)) > } > iter > }.count() > } > }) > thread.setDaemon(true) > thread.start() > // Wait a few seconds to make sure the task is running (too lazy for > listeners) > println("Waiting for tasks to start...") > TimeUnit.SECONDS.sleep(10) > // Now run a job that will touch
[jira] [Created] (SPARK-14253) Avoid registering temporary functions in Hive
Andrew Or created SPARK-14253: - Summary: Avoid registering temporary functions in Hive Key: SPARK-14253 URL: https://issues.apache.org/jira/browse/SPARK-14253 Project: Spark Issue Type: Bug Components: SQL Reporter: Andrew Or Assignee: Andrew Or Spark should just handle all temporary functions ourselves instead of passing it to Hive, which is what the HiveFunctionRegistry does at the moment. The extra call to Hive is unnecessary and potentially slow, and it makes the semantics of a "temporary function" confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14252) Executors do not try to download remote cached blocks
Marcelo Vanzin created SPARK-14252: -- Summary: Executors do not try to download remote cached blocks Key: SPARK-14252 URL: https://issues.apache.org/jira/browse/SPARK-14252 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Marcelo Vanzin I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 includes SPARK-12817, which changed the caching code a bit to remove duplication. But it seems to have removed the part where executors check whether other executors contain the needed cached block. In 1.6, that was done by the call to {{BlockManager.get}} in {{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls {{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and thus the executor never gets block that are cached by other executors, causing the blocks to be instead recomputed locally. I wrote a small program that shows this. In 1.6, running with {{--num-executors 2}}, I get 5 blocks cached on each executor, and messages like these in the logs: {noformat} 16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7 16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered locally 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 from BlockManagerId(1, blah, 58831) 1 {noformat} On 2.0, I get (almost) all the 10 partitions cached on *both* executors, because once the second one fails to find a block locally it just recomputes it and caches it. It never tries to download the block from the other executor. The log messages above, which still exist in the code, don't show up anywhere. Here's the code I used for the above (trimmed of some other stuff from my little test harness, so might not compile as is): {code} val sc = new SparkContext(conf) try { val rdd = sc.parallelize(1 to 1000, 10) rdd.cache() rdd.count() // Create a single task that will sleep and block, so that a particular executor is busy. // This should force future tasks to download cached data from that executor. println("Running sleep job..") val thread = new Thread(new Runnable() { override def run(): Unit = { rdd.mapPartitionsWithIndex { (i, iter) => if (i == 0) { Thread.sleep(TimeUnit.MINUTES.toMillis(10)) } iter }.count() } }) thread.setDaemon(true) thread.start() // Wait a few seconds to make sure the task is running (too lazy for listeners) println("Waiting for tasks to start...") TimeUnit.SECONDS.sleep(10) // Now run a job that will touch everything and should use the cached data. val cnt = rdd.map(_*2).count() println(s"Counted $cnt elements.") println("Killing sleep job.") thread.interrupt() thread.join() } finally { sc.stop() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14227) Add method for printing out generated code for debugging
[ https://issues.apache.org/jira/browse/SPARK-14227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14227. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.0.0 > Add method for printing out generated code for debugging > > > Key: SPARK-14227 > URL: https://issues.apache.org/jira/browse/SPARK-14227 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.0.0 > > > Currently it's hard to get at the output of WholeStageCodegen. It would be > nice to add debug support for the generated code. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216769#comment-15216769 ] Michael Zieliński commented on SPARK-7146: -- We ended up copying significant portion of Params (HasInputCol(s), HasOutputCol(s), HasProbabilityCol) for our custom Transformers/Estimators. This creates a problem if you want to abstract Transformer, e.g. have a piece of code that accepts any transformer with "setInputCol" method. Because our HasInputCol and native Spark HasInputCol are different traits, that results in very clunky code (structural typing + few ugly type aliases). Opening up at least some traits would result in cleaner code and more reuse. This document (https://docs.google.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#) separates traits in a few groups, like I/O, hyper-parameters and optimization. I/O would be a very good place to start, since those are re-used most frequently across algorithms. > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Discussion: Should the Param traits in sharedParams.scala be public? > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > Proposal: Either > (a) make the shared params private to encourage users to write specialized > documentation and value checks for parameters, or > (b) design a better way to encourage overriding documentation and parameter > value checks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14229) PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat
[ https://issues.apache.org/jira/browse/SPARK-14229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Jurney updated SPARK-14229: --- Shepherd: Matei Zaharia > PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat > -- > > Key: SPARK-14229 > URL: https://issues.apache.org/jira/browse/SPARK-14229 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark, Spark Shell >Affects Versions: 1.6.1 >Reporter: Russell Jurney > > I am able to save data to MongoDB from any RDD... provided that RDD does not > belong to a DataFrame. If I use DataFrame.rdd, it is not possible to save via > saveAsNewAPIHadoopFile whatsoever. I have tested that this applies to saving > to MongoDB, BSON Files, and ElasticSearch. > I get the following error when I try to save to a HadoopFile: > config = {"mongo.output.uri": > "mongodb://localhost:27017/agile_data_science.on_time_performance"} > n [3]: on_time_dataframe.rdd.saveAsNewAPIHadoopFile( >...: path='file://unused', >...: outputFormatClass='com.mongodb.hadoop.MongoOutputFormat', >...: keyClass='org.apache.hadoop.io.Text', >...: valueClass='org.apache.hadoop.io.MapWritable', >...: conf=config >...: ) > 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_1 stored as > values in memory (estimated size 62.7 KB, free 147.3 KB) > 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_1_piece0 stored > as bytes in memory (estimated size 20.4 KB, free 167.7 KB) > 16/03/28 19:59:57 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in > memory on localhost:61301 (size: 20.4 KB, free: 511.1 MB) > 16/03/28 19:59:57 INFO spark.SparkContext: Created broadcast 1 from > javaToPython at NativeMethodAccessorImpl.java:-2 > 16/03/28 19:59:57 INFO Configuration.deprecation: mapred.min.split.size is > deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize > 16/03/28 19:59:57 INFO parquet.ParquetRelation: Reading Parquet file(s) from > file:/Users/rjurney/Software/Agile_Data_Code_2/data/on_time_performance.parquet/part-r-0-32089f1b-5447-4a75-b008-4fd0a0a8b846.gz.parquet > 16/03/28 19:59:57 INFO spark.SparkContext: Starting job: take at > SerDeUtil.scala:231 > 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Got job 1 (take at > SerDeUtil.scala:231) with 1 output partitions > 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 > (take at SerDeUtil.scala:231) > 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Parents of final stage: List() > 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Missing parents: List() > 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Submitting ResultStage 1 > (MapPartitionsRDD[6] at mapPartitions at SerDeUtil.scala:146), which has no > missing parents > 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_2 stored as > values in memory (estimated size 14.9 KB, free 182.6 KB) > 16/03/28 19:59:57 INFO storage.MemoryStore: Block broadcast_2_piece0 stored > as bytes in memory (estimated size 7.5 KB, free 190.1 KB) > 16/03/28 19:59:57 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on localhost:61301 (size: 7.5 KB, free: 511.1 MB) > 16/03/28 19:59:57 INFO spark.SparkContext: Created broadcast 2 from broadcast > at DAGScheduler.scala:1006 > 16/03/28 19:59:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from ResultStage 1 (MapPartitionsRDD[6] at mapPartitions at > SerDeUtil.scala:146) > 16/03/28 19:59:57 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with > 1 tasks > 16/03/28 19:59:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 1.0 (TID 8, localhost, partition 0,PROCESS_LOCAL, 2739 bytes) > 16/03/28 19:59:57 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID > 8) > 16/03/28 19:59:58 INFO > parquet.ParquetRelation$$anonfun$buildInternalScan$1$$anon$1: Input split: > ParquetInputSplit{part: > file:/Users/rjurney/Software/Agile_Data_Code_2/data/on_time_performance.parquet/part-r-0-32089f1b-5447-4a75-b008-4fd0a0a8b846.gz.parquet > start: 0 end: 134217728 length: 134217728 hosts: []} > 16/03/28 19:59:59 INFO compress.CodecPool: Got brand-new decompressor [.gz] > 16/03/28 19:59:59 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 8) > net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for pyspark.sql.types._create_row) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) > at
[jira] [Commented] (SPARK-14229) PySpark DataFrame.rdd's can't be saved to an arbitrary Hadoop OutputFormat
[ https://issues.apache.org/jira/browse/SPARK-14229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216759#comment-15216759 ] Russell Jurney commented on SPARK-14229: To reproduce for MongoDB: The data I am having trouble storing is at s3://agile_data_science/on_time_performance.parquet and its ACL is set to public, so you should be able to load it. If you can't reproduce, let me know and I will try to setup S3 access from my machine (I'm working in local mode). My init code: {code:title=init.py} import json import pymongo import pymongo_spark pymongo_spark.activate() {code} To reproduce a successful loading of this data, as a textFile to an RDD without ever being a DataFrame (note that this generates 20GB of temporary files, be sure you have the space!): {code:title=success.py} on_time_lines = sc.textFile("s3://agile_data_science/On_Time_On_Time_Performance_2015.jsonl.gz") on_time_performance = on_time_lines.map(lambda x: json.loads(x)) on_time_performance.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance') {code} To reproduce a failed loading of this data when saving the RDD of a DataFrame or its descendants: {code:title=failure.py} on_time_dataframe = sqlContext.read.parquet('s3://agile_data_science/on_time_performance.parquet') on_time_dataframe = on_time_dataframe.drop("") # remove empty field on_time_dataframe.rdd.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance') {code} And with the longer syntax: {code:title=long_failure.py} config = {"mongo.output.uri": "mongodb://localhost:27017/agile_data_science.on_time_performance"} on_time_dataframe.rdd.saveAsNewAPIHadoopFile( path='file://unused', outputFormatClass='com.mongodb.hadoop.MongoOutputFormat', keyClass='org.apache.hadoop.io.Text', valueClass='org.apache.hadoop.io.MapWritable', conf=config ) {code} The documents look like: {code} Row(Year=2015, Quarter=1, Month=1, DayofMonth=1, DayOfWeek=4, FlightDate=u'2015-01-01', UniqueCarrier=u'AA', AirlineID=19805, Carrier=u'AA', TailNum=u'N787AA', FlightNum=1, OriginAirportID=12478, OriginAirportSeqID=1247802, OriginCityMarketID=31703, Origin=u'JFK', OriginCityName=u'New York, NY', OriginState=u'NY', OriginStateFips=36, OriginStateName=u'New York', OriginWac=22, DestAirportID=12892, DestAirportSeqID=1289203, DestCityMarketID=32575, Dest=u'LAX', DestCityName=u'Los Angeles, CA', DestState=u'CA', DestStateFips=6, DestStateName=u'California', DestWac=91, CRSDepTime=900, DepTime=855, DepDelay=-5.0, DepDelayMinutes=0.0, DepDel15=0.0, DepartureDelayGroups=-1, DepTimeBlk=u'0900-0959', TaxiOut=17.0, WheelsOff=912, WheelsOn=1230, TaxiIn=7.0, CRSArrTime=1230, ArrTime=1237, ArrDelay=7.0, ArrDelayMinutes=7.0, ArrDel15=0.0, ArrivalDelayGroups=0, ArrTimeBlk=u'1200-1259', Cancelled=0.0, CancellationCode=u'', Diverted=0.0, CRSElapsedTime=390.0, ActualElapsedTime=402.0, AirTime=378.0, Flights=1.0, Distance=2475.0, DistanceGroup=10, CarrierDelay=None, WeatherDelay=None, NASDelay=None, SecurityDelay=None, LateAircraftDelay=None, FirstDepTime=None, TotalAddGTime=None, LongestAddGTime=None, DivAirportLandings=0, DivReachedDest=None, DivActualElapsedTime=None, DivArrDelay=None, DivDistance=None, Div1Airport=u'', Div1AirportID=None, Div1AirportSeqID=None, Div1WheelsOn=None, Div1TotalGTime=None, Div1LongestGTime=None, Div1WheelsOff=None, Div1TailNum=u'', Div2Airport=u'', Div2AirportID=None, Div2AirportSeqID=None, Div2WheelsOn=None, Div2TotalGTime=None, Div2LongestGTime=None, Div2WheelsOff=None, Div2TailNum=u'', Div3Airport=u'', Div3AirportID=None, Div3AirportSeqID=None, Div3WheelsOn=None, Div3TotalGTime=None, Div3LongestGTime=None, Div3WheelsOff=u'', Div3TailNum=u'', Div4Airport=u'', Div4AirportID=u'', Div4AirportSeqID=u'', Div4WheelsOn=u'', Div4TotalGTime=u'', Div4LongestGTime=u'', Div4WheelsOff=u'', Div4TailNum=u'', Div5Airport=u'', Div5AirportID=u'', Div5AirportSeqID=u'', Div5WheelsOn=u'', Div5TotalGTime=u'', Div5LongestGTime=u'', Div5WheelsOff=u'', Div5TailNum=u'') {code} As JSON: {code:title=data.json} {"Year":2015,"Quarter":1,"Month":1,"DayofMonth":10,"DayOfWeek":6,"FlightDate":"2015-01-10","UniqueCarrier":"AA","AirlineID":19805,"Carrier":"AA","TailNum":"N790AA","FlightNum":1,"OriginAirportID":12478,"OriginAirportSeqID":1247802,"OriginCityMarketID":31703,"Origin":"JFK","OriginCityName":"New York, NY","OriginState":"NY","OriginStateFips":36,"OriginStateName":"New York","OriginWac":22,"DestAirportID":12892,"DestAirportSeqID":1289203,"DestCityMarketID":32575,"Dest":"LAX","DestCityName":"Los Angeles,
[jira] [Commented] (SPARK-14033) Merging Estimator & Model
[ https://issues.apache.org/jira/browse/SPARK-14033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216756#comment-15216756 ] Michael Zieliński commented on SPARK-14033: --- Re: ML vs MLLib, I also think about it in terms RDDs versus DataFrames. Re: Estimator/Model I prefer the current version that preserves immutability to a larger degree. That said, maybe merging those concepts would make it easier for the next stage of a Pipeline to use outputs from previous stage. Currently if you have: val a1 = new Estimator1 val a2 = new Estimator2.setParamAbc(a1.getParamCde) You can only get the members from Estimator1, but not Estimator1Model. If they're the same class it would make things easier. As an example you want to take top K variables from Random Forest model as input to Logistic Regression. > Merging Estimator & Model > - > > Key: SPARK-14033 > URL: https://issues.apache.org/jira/browse/SPARK-14033 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Attachments: StyleMutabilityMergingEstimatorandModel.pdf > > > This JIRA is for merging the spark.ml concepts of Estimator and Model. > Goal: Have clearer semantics which match existing libraries (such as > scikit-learn). > For details, please see the linked design doc. Comment on this JIRA to give > feedback on the proposed design. Once the proposal is discussed and this > work is confirmed as ready to proceed, this JIRA will serve as an umbrella > for the merge tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14248) Get the path hierarchy from root to leaf in the BisectingKMeansModel
[ https://issues.apache.org/jira/browse/SPARK-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216747#comment-15216747 ] Lakesh commented on SPARK-14248: [~josephkb]. It would be a very small addition. You can assign it to me if appropriate. Thanks > Get the path hierarchy from root to leaf in the BisectingKMeansModel > > > Key: SPARK-14248 > URL: https://issues.apache.org/jira/browse/SPARK-14248 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 >Reporter: Lakesh >Priority: Trivial > Labels: newbie > > I was using the BisectingKMeansModel of mllib in spark. However, I couldn't > find an existing method in the BisectingKMeansModel to return me the path > hierarchy. So I added a small function that would return me the prediction > path itself. I was thinking this could be useful for people with similar use > case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14246) vars not updated after Scala script reload
[ https://issues.apache.org/jira/browse/SPARK-14246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216734#comment-15216734 ] Jim Powers commented on SPARK-14246: It appears that this is a Scala problem and not a spark one. {noformat} scala -Yrepl-class-based Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74). Type in expressions for evaluation. Or try :help. scala> :load Fail.scala Loading Fail.scala... X: Serializable{def getArray(n: Int): Array[Double]; def multiplySum(x: Double,v: Seq[Double]): Double} = $anon$1@5ddf0d24 scala> val a = X.getArray(100) warning: there was one feature warning; re-run with -feature for details a: Array[Double] = Array(0.6445967025789217, 0.8356433165638456, 0.9050186287112574, 0.06344554850936357, 0.008363070900988756, 0.6593626537474886, 0.7424265307039932, 0.5035629973234215, 0.5510831160354674, 0.37366654438205593, 0.33299751842582703, 0.3800633883472283, 0.4153963387304084, 0.2752468331316783, 0.8699452196820426, 0.31938530945559984, 0.7990568815957724, 0.6875841747139724, 0.31949965197609675, 0.026911873556428878, 0.2616536698127736, 0.5580118021155783, 0.28345994848845435, 0.1773433165304532, 0.2549417030032525, 0.9777692443616465, 0.6296846343712603, 0.8589339648876033, 0.7020098253707141, 0.8518829567943531, 0.41154622619731374, 0.1075129613308311, 0.8499252434316056, 0.3841876768177086, 0.137415801614582, 0.27030938222499756, 0.5511585560527115, 0.26252884257087217, ... scala> X = null X: Serializable{def getArray(n: Int): Array[Double]; def multiplySum(x: Double,v: Seq[Double]): Double} = null scala> :load Fail.scala Loading Fail.scala... X: Serializable{def getArray(n: Int): Array[Double]; def multiplySum(x: Double,v: Seq[Double]): Double} = $anon$1@52c8295b scala> X res0: Serializable{def getArray(n: Int): Array[Double]; def multiplySum(x: Double,v: Seq[Double]): Double} = null {noformat} So, it appears any anonymous class with 2 or more members may exhibit this problem in the presence of {{-Yrepl-class-based}}. Closing this issue. > vars not updated after Scala script reload > -- > > Key: SPARK-14246 > URL: https://issues.apache.org/jira/browse/SPARK-14246 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Jim Powers > Attachments: Fail.scala, Null.scala, reproduce_transient_npe.scala > > > Attached are two scripts. The problem only exhibits itself with Spark 1.6.0, > 1.6.1, and 2.0.0 using Scala 2.11. Scala 2.10 does not exhibit this problem. > With the Regular Scala 2.11(.7) REPL: > {noformat} > scala> :load reproduce_transient_npe.scala > Loading reproduce_transient_npe.scala... > X: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def > multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = > $anon$1@4149c063 > scala> X > res0: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def > multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = > $anon$1@4149c063 > scala> val a = X.getArray(10) > warning: there was one feature warning; re-run with -feature for details > a: Array[Double] = Array(0.1701063617079236, 0.17570862034857437, > 0.6065851472098629, 0.4683069994589304, 0.35194859652378363, > 0.04033043823203897, 0.11917887149548367, 0.540367871104426, > 0.18544859881040276, 0.7236380062803334) > scala> X = null > X: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def > multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = null > scala> :load reproduce_transient_npe.scala > Loading reproduce_transient_npe.scala... > X: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def > multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = > $anon$1@5860f3d7 > scala> X > res1: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def > multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = > $anon$1@5860f3d7 > {noformat} > However, from within the Spark shell (Spark 1.6.0, Scala 2.11.7): > {noformat} > scala> :load reproduce_transient_npe.scala > Loading reproduce_transient_npe.scala... > X: Serializable{val cf: Double; def getArray(n: Int): Array[Double]; def > multiplySum(x: Double,v: org.apache.spark.rdd.RDD[Double]): Double} = > $anon$1@750e2d33 > scala> val a = X.getArray(100) > warning: there was one feature warning; re-run with -feature for details > a: Array[Double] = Array(0.6330055191546612, 0.017865502179445936, > 0.6334775064489349, 0.9053929733525056, 0.7648311134918273, > 0.5423177955113584, 0.5164344368587143, 0.420054677669768, > 0.7842112717076851, 0.2098345684721057, 0.7925640951404774, > 0.5604706596425998, 0.8104403239147542, 0.7567005967624031, > 0.5221119883682028,