[jira] [Updated] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4231: - Assignee: Debasish Das > Add RankingMetrics to examples.MovieLensALS > --- > > Key: SPARK-4231 > URL: https://issues.apache.org/jira/browse/SPARK-4231 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 1.4.0 >Reporter: Debasish Das >Assignee: Debasish Das >Priority: Minor > Fix For: 1.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > examples.MovieLensALS computes RMSE for movielens dataset but after addition > of RankingMetrics and enhancements to ALS, it is critical to look at not only > the RMSE but also measures like prec@k and MAP. > In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and > also added a flag that takes an input whether user/product recommendation is > being validated. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12143) When column type is binary, select occurs ClassCastExcption in Beeline.
[ https://issues.apache.org/jira/browse/SPARK-12143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12143: -- Component/s: SQL [~meiyoula] set component please and read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Please edit the description too. It doesn't sound like it's even Spark related as written. > When column type is binary, select occurs ClassCastExcption in Beeline. > --- > > Key: SPARK-12143 > URL: https://issues.apache.org/jira/browse/SPARK-12143 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > > In Beeline, execute below sql: > 1. create table bb(bi binary); > 2. load data inpath 'tmp/data' into table bb; > 3.select * from bb; > Error: java.lang.ClassCastException: java.lang.String cannot be cast to [B > (state=, code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12156) SPARK_EXECUTOR_INSTANCES is not effective
[ https://issues.apache.org/jira/browse/SPARK-12156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12156: -- Target Version/s: (was: 1.6.0) Priority: Minor (was: Major) Fix Version/s: (was: 1.6.0) [~KaiXinXIaoLei] don't set target/fix version, and set priority appropriately. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > SPARK_EXECUTOR_INSTANCES is not effective > -- > > Key: SPARK-12156 > URL: https://issues.apache.org/jira/browse/SPARK-12156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: KaiXinXIaoLei >Priority: Minor > > I set SPARK_EXECUTOR_INSTANCES=3, but two executors starts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12164: Description: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, For example, [ implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); ] was: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, For example, implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, we are using comma-separated decimal format to output the encoded > contents. This way is rare when the data is in binary. This could be a common > issue when we use Dataset API. For example, > For example, > [ > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12164: Description: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, For example, implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); was: So far, we are using decimal format to output the encoded contents. This way is rare when the data is in binary. For example, implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, we are using comma-separated decimal format to output the encoded > contents. This way is rare when the data is in binary. This could be a common > issue when we use Dataset API. For example, > For example, > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12164: Description: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, For example, { implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); } was: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, For example, [ implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); ] > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, we are using comma-separated decimal format to output the encoded > contents. This way is rare when the data is in binary. This could be a common > issue when we use Dataset API. For example, > For example, > { > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12164: Description: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, {code} implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); {code} was: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, For example, {code} implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); {code} > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, we are using comma-separated decimal format to output the encoded > contents. This way is rare when the data is in binary. This could be a common > issue when we use Dataset API. > For example, > {code} > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12164: Description: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, For example, {code} implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); {code} was: So far, we are using comma-separated decimal format to output the encoded contents. This way is rare when the data is in binary. This could be a common issue when we use Dataset API. For example, For example, { implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); } > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, we are using comma-separated decimal format to output the encoded > contents. This way is rare when the data is in binary. This could be a common > issue when we use Dataset API. For example, > For example, > {code} > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12125) pull out nondeterministic expressions from Join
[ https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12125: -- Component/s: SQL > pull out nondeterministic expressions from Join > --- > > Key: SPARK-12125 > URL: https://issues.apache.org/jira/browse/SPARK-12125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: iward >Priority: Minor > > Currently,*nondeterministic expressions* are only allowed in *Project* or > *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can > be pulled out. > But,Sometime in many case,we will use nondeterministic expressions to process > *join keys* avoiding data skew.for example: > {noformat} > select * > from tableA a > join > (select * from tableB) b > on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( > (-rand() * 1000 ) as string ) else a.brand_code end )) = b.brand_code > {noformat} > This PR introduce a mechanism to pull out nondeterministic expressions from > *Join*,so we can use nondeterministic expression in *Join* appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize
[ https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12060: -- Fix Version/s: (was: 1.6.0) > Avoid memory copy in JavaSerializerInstance.serialize > - > > Key: SPARK-12060 > URL: https://issues.apache.org/jira/browse/SPARK-12060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Critical > > JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to > get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the > content in the internal array to a new array. However, since the array will > be converted to ByteBuffer at once, we can avoid the memory copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12159) Add user guide section for IndexToString transformer
[ https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12159: Assignee: Apache Spark > Add user guide section for IndexToString transformer > > > Key: SPARK-12159 > URL: https://issues.apache.org/jira/browse/SPARK-12159 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > Add a user guide section for the IndexToString transformer as reported on the > mailing list ( > https://www.mail-archive.com/dev@spark.apache.org/msg12263.html ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12159) Add user guide section for IndexToString transformer
[ https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12159: Assignee: (was: Apache Spark) > Add user guide section for IndexToString transformer > > > Key: SPARK-12159 > URL: https://issues.apache.org/jira/browse/SPARK-12159 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Add a user guide section for the IndexToString transformer as reported on the > mailing list ( > https://www.mail-archive.com/dev@spark.apache.org/msg12263.html ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12159) Add user guide section for IndexToString transformer
[ https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043995#comment-15043995 ] Apache Spark commented on SPARK-12159: -- User 'BenFradet' has created a pull request for this issue: https://github.com/apache/spark/pull/10166 > Add user guide section for IndexToString transformer > > > Key: SPARK-12159 > URL: https://issues.apache.org/jira/browse/SPARK-12159 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Add a user guide section for the IndexToString transformer as reported on the > mailing list ( > https://www.mail-archive.com/dev@spark.apache.org/msg12263.html ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12164: Assignee: (was: Apache Spark) > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, we are using decimal format to output the encoded contents. This way > is rare when the data is in binary. > For example, > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043983#comment-15043983 ] Apache Spark commented on SPARK-12164: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/10165 > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, we are using decimal format to output the encoded contents. This way > is rare when the data is in binary. > For example, > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12164: Assignee: Apache Spark > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Apache Spark > > So far, we are using decimal format to output the encoded contents. This way > is rare when the data is in binary. > For example, > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12153) Word2Vec uses a fixed length for sentences which is not reasonable for reality, and similarity functions and fields are not accessible
[ https://issues.apache.org/jira/browse/SPARK-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12153: -- Labels: (was: patch) Priority: Minor (was: Major) (I don't think this can be considered major) > Word2Vec uses a fixed length for sentences which is not reasonable for > reality, and similarity functions and fields are not accessible > -- > > Key: SPARK-12153 > URL: https://issues.apache.org/jira/browse/SPARK-12153 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.2 >Reporter: YongGang Cao >Priority: Minor > > sentence boundary matters for sliding window, we shouldn't train model from a > window across sentences. the current 100 word as a hard split for sentences > doesn't really make sense. > And the cosinesimilarity functions is private which is useless for caller. > we may need to access the vocabulary and wordindex table as well, those need > getters > I made changes to address above issues. > here is the pull request: https://github.com/apache/spark/pull/10152 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12162) Embedded Spark on JBoss server cause crashing due system.exit when SparkUncaughtExceptionHandler called
[ https://issues.apache.org/jira/browse/SPARK-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043782#comment-15043782 ] Sasi commented on SPARK-12162: -- Hey, Thanks for the quick response. On my JBoss i'm running only new SparkContext(sparkConf) and SQLContext(sparkContext); I have other machine that run my workers and on the same machine i'm running the master. Is that the right way or am I missing something? Thanks a lot! Sasi > Embedded Spark on JBoss server cause crashing due system.exit when > SparkUncaughtExceptionHandler called > --- > > Key: SPARK-12162 > URL: https://issues.apache.org/jira/browse/SPARK-12162 > Project: Spark > Issue Type: Bug >Reporter: Sasi >Priority: Critical > > Hello, > I'm running Spark on JBoss and some times i'm getting the following exception: > {code} > ERROR : (org.apache.spark.util.SparkUncaughtExceptionHandler:96) > -[appclient-registration-retry-thread] Uncaught exception in thread > Thread[appclient-registration-retry-thread,5,jboss] > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.FutureTask@4e33f83e rejected from > java.util.concurrent.ThreadPoolExecutor@35eed68e[Running, pool size = 1, > active threads = 0, queued tasks = 0, completed tasks = 3] > {code} > Then my JBoss crashed, so I take a look on the source of > SparkUncaughtExceptionHandler and I notes that when the exception called it > do System.exit(SparkExitCode.UNCAUGHT_EXCEPTION). > [https://github.com/apache/spark/blob/3bd77b213a9cd177c3ea3c61d37e5098e55f75a5/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala] > Since the System.exit(...) called then my JBoss crash. > Any workaround/fix that can help me? > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9603) Re-enable complex R package test in SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9603: - Fix Version/s: (was: 1.6.0) > Re-enable complex R package test in SparkSubmitSuite > > > Key: SPARK-9603 > URL: https://issues.apache.org/jira/browse/SPARK-9603 > Project: Spark > Issue Type: Test > Components: Deploy, SparkR, Tests >Affects Versions: 1.5.0 >Reporter: Burak Yavuz >Assignee: Sun Rui > > For building complex Spark Packages that contain R code in addition to Scala, > we have a complex procedure, where R source code is shipped inside a jar. The > source code is extracted, built, and is added as a library among SparkR. > The end to end test in SparkSubmitSuite ("correctly builds R packages > included in a jar with --packages") can't run on Jenkins now, because the > pull request builder is not built with SparkR. Once the PR Builder is built > with SparkR, we should re-enable the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12163) FPGrowth unusable on some datasets without extensive tweaking of the support threshold
[ https://issues.apache.org/jira/browse/SPARK-12163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12163: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > FPGrowth unusable on some datasets without extensive tweaking of the support > threshold > -- > > Key: SPARK-12163 > URL: https://issues.apache.org/jira/browse/SPARK-12163 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jaroslav Kuchar >Priority: Minor > > This problem occurs on standard machine learning UCI datasets. > Details for "audiology" dataset follows: It contains only 226 transactions > and 70 attributes. Mining of frequent itemsets with support threshold 0.95 > will produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets. > More details about experiment: > https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1 > The number of generated itemsets rapidly growths with a number of unique > items in transactions. Considering the combinatorial explosion, it can cause > performing CPU-intensive and long running tasks for various settings of the > support threshold. This extensive tweaking of the support threshold makes the > usage of the FPGrowth implementation unusable even for a small dataset. > It would be useful to implement additional stopping criterions to control the > explosion of itemsets’ count in FPGrowth. We propose to implement optional > limit for maximum number of generated itemsets or maximum number of items per > itemset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12136) rddToFileName does not properly handle prefix and suffix parameters
[ https://issues.apache.org/jira/browse/SPARK-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12136: -- Labels: starter (was: ) [~naveenminchu] Just read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and make a pull request then. Should be trivial to separately handle the prefix and suffix. > rddToFileName does not properly handle prefix and suffix parameters > --- > > Key: SPARK-12136 > URL: https://issues.apache.org/jira/browse/SPARK-12136 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Brian Webb >Priority: Minor > Labels: starter > > See code here: > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala#L894 > private[streaming] def rddToFileName[T](prefix: String, suffix: String, > time: Time): String = { > if (prefix == null) { > time.milliseconds.toString > } else if (suffix == null || suffix.length ==0) { > prefix + "-" + time.milliseconds > } else { > prefix + "-" + time.milliseconds + "." + suffix > } > } > This code does not seem to properly handle the cases where the prefix is > null, but suffix is not null - the suffix should be used but is not. > Also, the check for length == 0 is only applied to the suffix, bot the > prefix. It seems the check should be consistent between the two. > Is there a reason not to address these two issues and change the code? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12162) Embedded Spark on JBoss server cause crashing due system.exit when SparkUncaughtExceptionHandler called
[ https://issues.apache.org/jira/browse/SPARK-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12162. --- Resolution: Not A Problem I think the answer is that you can't in general "embed" the Spark executor processes. > Embedded Spark on JBoss server cause crashing due system.exit when > SparkUncaughtExceptionHandler called > --- > > Key: SPARK-12162 > URL: https://issues.apache.org/jira/browse/SPARK-12162 > Project: Spark > Issue Type: Bug >Reporter: Sasi >Priority: Critical > > Hello, > I'm running Spark on JBoss and some times i'm getting the following exception: > {code} > ERROR : (org.apache.spark.util.SparkUncaughtExceptionHandler:96) > -[appclient-registration-retry-thread] Uncaught exception in thread > Thread[appclient-registration-retry-thread,5,jboss] > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.FutureTask@4e33f83e rejected from > java.util.concurrent.ThreadPoolExecutor@35eed68e[Running, pool size = 1, > active threads = 0, queued tasks = 0, completed tasks = 3] > {code} > Then my JBoss crashed, so I take a look on the source of > SparkUncaughtExceptionHandler and I notes that when the exception called it > do System.exit(SparkExitCode.UNCAUGHT_EXCEPTION). > [https://github.com/apache/spark/blob/3bd77b213a9cd177c3ea3c61d37e5098e55f75a5/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala] > Since the System.exit(...) called then my JBoss crash. > Any workaround/fix that can help me? > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12163) FPGrowth unusable on some datasets without extensive tweaking of the support threshold
Jaroslav Kuchar created SPARK-12163: --- Summary: FPGrowth unusable on some datasets without extensive tweaking of the support threshold Key: SPARK-12163 URL: https://issues.apache.org/jira/browse/SPARK-12163 Project: Spark Issue Type: Bug Components: MLlib Reporter: Jaroslav Kuchar This problem occurs on standard machine learning UCI datasets. Details for "audiology" dataset follows: It contains only 226 transactions and 70 attributes. Mining of frequent itemsets with support threshold 0.95 will produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets. More details about experiment: https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1 The number of generated itemsets rapidly growths with a number of unique items in transactions. Considering the combinatorial explosion, it can cause performing CPU-intensive and long running tasks for various settings of the support threshold. This extensive tweaking of the support threshold makes the usage of the FPGrowth implementation unusable even for a small dataset. It would be useful to implement additional stopping criterions to control the explosion of itemsets’ count in FPGrowth. We propose to implement optional limit for maximum number of generated itemsets or maximum number of items per itemset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12125) pull out nondeterministic expressions from Join
[ https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12125: -- Affects Version/s: (was: 1.5.1) (was: 1.5.0) Target Version/s: (was: 1.6.0) Priority: Minor (was: Major) Fix Version/s: (was: 1.6.0) [~iward] don't set target/fix version. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > pull out nondeterministic expressions from Join > --- > > Key: SPARK-12125 > URL: https://issues.apache.org/jira/browse/SPARK-12125 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 >Reporter: iward >Priority: Minor > > Currently,*nondeterministic expressions* are only allowed in *Project* or > *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can > be pulled out. > But,Sometime in many case,we will use nondeterministic expressions to process > *join keys* avoiding data skew.for example: > {noformat} > select * > from tableA a > join > (select * from tableB) b > on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( > (-rand() * 1000 ) as string ) else a.brand_code end )) = b.brand_code > {noformat} > This PR introduce a mechanism to pull out nondeterministic expressions from > *Join*,so we can use nondeterministic expression in *Join* appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044223#comment-15044223 ] hujiayin commented on SPARK-4036: - Hi Andrew, The code is implemented by Scala and integrated with Spark. I tested it after I implemented it. I also verify it with some papers listed inside the code and this jira. Could you send me your features and models that I can have further testing? > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044223#comment-15044223 ] hujiayin edited comment on SPARK-4036 at 12/7/15 1:00 AM: -- Hi Andrew, The code is implemented by Scala and integrated with Spark. I tested it after I implemented it. I also verified it with some papers listed inside the code and this jira. Could you send me your features and models that I can have further testing? was (Author: hujiayin): Hi Andrew, The code is implemented by Scala and integrated with Spark. I tested it after I implemented it. I also verify it with some papers listed inside the code and this jira. Could you send me your features and models that I can have further testing? > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
[ https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-12155: -- Assignee: Josh Rosen > Execution OOM after a relative large dataset cached in the cluster. > --- > > Key: SPARK-12155 > URL: https://issues.apache.org/jira/browse/SPARK-12155 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Yin Huai >Assignee: Josh Rosen >Priority: Blocker > > I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. > When I start to consume the query. I got the following exception (I added > more logs to the code). > {code} > 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for > 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize. > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for > block rdd_94_37(free: 3253659951, max: 16798973952) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for > block rdd_94_37(free: 3252611375, max: 16798973952) > 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). > 3028 bytes result sent to driver > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for > block rdd_94_37(free: 3314840375, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for > block rdd_94_37(free: 3215892137, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space > for block rdd_94_37(free: 3117216424, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space > for block rdd_94_37(free: 2919868859, max: 16866344960) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space > for block rdd_94_37(free: 2687050010, max: 16929521664) > 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). > 3028 bytes result sent to driver > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space > for block rdd_94_37(free: 2292321531, max: 16929521664) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space > for block rdd_94_37(free: 1701062715, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space > for block rdd_94_37(free: 799417533, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would > require dropping another block from the same RDD > 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in > memory! (computed 2.4 GB so far) > 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB > (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB. > 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory > from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of > memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). > 3077 bytes result sent to driver > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120 > 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120) > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132 > 15/12/05 01:20:56 INFO Executor: Running task 9.0 in stage 5.0
[jira] [Resolved] (SPARK-12152) Speed up Scalastyle by only running one SBT command instead of four
[ https://issues.apache.org/jira/browse/SPARK-12152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-12152. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10151 [https://github.com/apache/spark/pull/10151] > Speed up Scalastyle by only running one SBT command instead of four > --- > > Key: SPARK-12152 > URL: https://issues.apache.org/jira/browse/SPARK-12152 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > dev/scalastyle runs four SBT commands when only one would suffice. We should > fix this in order to speed up pull request builds by about 60 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12166) Unset hadoop related environment in testing
Jeff Zhang created SPARK-12166: -- Summary: Unset hadoop related environment in testing Key: SPARK-12166 URL: https://issues.apache.org/jira/browse/SPARK-12166 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 1.5.2 Reporter: Jeff Zhang I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause is that spark is still using my local single node cluster hadoop when doing the unit test. I don't think it make sense to do that. These environment variable should be unset before the testing. And I suspect dev/run-tests also didn't do that either. Here's the error message: {code} Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxr-xr-x [info] at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) [info] at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) [info] at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) [info] at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12125) pull out nondeterministic expressions from Join
[ https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044272#comment-15044272 ] iward commented on SPARK-12125: --- Ok,get.Thanks a lot. > pull out nondeterministic expressions from Join > --- > > Key: SPARK-12125 > URL: https://issues.apache.org/jira/browse/SPARK-12125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: iward >Priority: Minor > > Currently,*nondeterministic expressions* are only allowed in *Project* or > *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can > be pulled out. > But,Sometime in many case,we will use nondeterministic expressions to process > *join keys* avoiding data skew.for example: > {noformat} > select * > from tableA a > join > (select * from tableB) b > on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( > (-rand() * 1000 ) as string ) else a.brand_code end )) = b.brand_code > {noformat} > This PR introduce a mechanism to pull out nondeterministic expressions from > *Join*,so we can use nondeterministic expression in *Join* appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
[ https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044237#comment-15044237 ] Josh Rosen commented on SPARK-12155: I'm working on fixing this issue. I have a regression test for this bug at https://github.com/apache/spark/commit/4c8110ddeee990507c9347700dec557fc22a55a5. While investigating this, I found a closely-related bug which impacts eviction of storage memory in cases where you have only a single running task on an executor (this bug, SPARK-12155, is triggered by having multiple running tasks on an executor). I'm going to break down the task of fixing this bug into a series of smaller patches in order to lessen the review burden. > Execution OOM after a relative large dataset cached in the cluster. > --- > > Key: SPARK-12155 > URL: https://issues.apache.org/jira/browse/SPARK-12155 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Yin Huai >Assignee: Josh Rosen >Priority: Blocker > > I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. > When I start to consume the query. I got the following exception (I added > more logs to the code). > {code} > 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for > 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize. > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for > block rdd_94_37(free: 3253659951, max: 16798973952) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for > block rdd_94_37(free: 3252611375, max: 16798973952) > 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). > 3028 bytes result sent to driver > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for > block rdd_94_37(free: 3314840375, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for > block rdd_94_37(free: 3215892137, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space > for block rdd_94_37(free: 3117216424, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space > for block rdd_94_37(free: 2919868859, max: 16866344960) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space > for block rdd_94_37(free: 2687050010, max: 16929521664) > 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). > 3028 bytes result sent to driver > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space > for block rdd_94_37(free: 2292321531, max: 16929521664) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space > for block rdd_94_37(free: 1701062715, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space > for block rdd_94_37(free: 799417533, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would > require dropping another block from the same RDD > 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in > memory! (computed 2.4 GB so far) > 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB > (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB. > 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory > from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of > memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05
[jira] [Commented] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize
[ https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044304#comment-15044304 ] Apache Spark commented on SPARK-12060: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/10167 > Avoid memory copy in JavaSerializerInstance.serialize > - > > Key: SPARK-12060 > URL: https://issues.apache.org/jira/browse/SPARK-12060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Critical > > JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to > get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the > content in the internal array to a new array. However, since the array will > be converted to ByteBuffer at once, we can avoid the memory copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12138. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 10155 [https://github.com/apache/spark/pull/10155] > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > Fix For: 1.6.0 > > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12138) Escape \u in the generated comments.
[ https://issues.apache.org/jira/browse/SPARK-12138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12138: - Assignee: Xiao Li > Escape \u in the generated comments. > > > Key: SPARK-12138 > URL: https://issues.apache.org/jira/browse/SPARK-12138 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li > Fix For: 1.6.0 > > > https://spark-tests.appspot.com/test-logs/12683942 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12040) Add toJson/fromJson to Vector/Vectors for PySpark
[ https://issues.apache.org/jira/browse/SPARK-12040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044145#comment-15044145 ] holdenk commented on SPARK-12040: - So this came out of wanting to have a matching API between Scala/Python since the toJson/fromJson methods are public. We could use the wrappers for any of the models which are implemented in Scala - but if we do any models/transformers in Python copying the vector to/from the JVM to write to JSON would be a waste. > Add toJson/fromJson to Vector/Vectors for PySpark > - > > Key: SPARK-12040 > URL: https://issues.apache.org/jira/browse/SPARK-12040 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Yanbo Liang >Priority: Trivial > Labels: starter > > Add toJson/fromJson to Vector/Vectors for PySpark, please refer the Scala one > SPARK-11766. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12169) SparkR 2.0
Sun Rui created SPARK-12169: --- Summary: SparkR 2.0 Key: SPARK-12169 URL: https://issues.apache.org/jira/browse/SPARK-12169 Project: Spark Issue Type: Bug Components: SparkR Reporter: Sun Rui This is an umbrella issue addressing all SparkR related issues corresponding to Spark 2.0 being planned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12158) [R] [SQL] Fix 'sample' functions that break R unit test cases
[ https://issues.apache.org/jira/browse/SPARK-12158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12158: Component/s: (was: R) SparkR > [R] [SQL] Fix 'sample' functions that break R unit test cases > - > > Key: SPARK-12158 > URL: https://issues.apache.org/jira/browse/SPARK-12158 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Priority: Critical > > The existing sample functions miss the parameter 'seed', however, the > corresponding function interface in `generics` has such a parameter. > This could cause SparkR unit tests failed. For example, I hit it in one PR: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12169) SparkR 2.0
[ https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044418#comment-15044418 ] Shivaram Venkataraman commented on SPARK-12169: --- Thanks [~sunrui] for starting this. You can assign all the issues to me if you want to avoid people picking these up while we discuss them. > SparkR 2.0 > -- > > Key: SPARK-12169 > URL: https://issues.apache.org/jira/browse/SPARK-12169 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Sun Rui > > This is an umbrella issue addressing all SparkR related issues corresponding > to Spark 2.0 being planned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries
Yadong Qi created SPARK-12167: - Summary: Invoke the right sameResult function when plan is warpped with SubQueries Key: SPARK-12167 URL: https://issues.apache.org/jira/browse/SPARK-12167 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Reporter: Yadong Qi I find this bug when I use cache table, ``` spark-sql> create table src_p(key int, value int) stored as parquet; OK Time taken: 3.144 seconds spark-sql> cache table src_p; Time taken: 1.452 seconds spark-sql> explain extended select count(*) from src_p; ``` I got the wrong physical plan ``` == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#28L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#33L]) Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][] ``` and the right physical plan is ``` == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#47L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#62L]) InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, 1, StorageLevel(true, true, false, true, 1), (Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]), Some(src_p)) ``` When the implementation classes of `MultiInstanceRelation`(eg. `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't invoke the right `sameResult` function in their own implementation. So we need to eliminate SubQueries first and then try to invoke `sameResult` function in their own implementation. Like: When plan is `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p], expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first eliminate SubQueries, and then will invoke the `sameResult` function in `LogicalRelation` instead of `LogicalPlan`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12165) Execution memory requests may fail to evict storage blocks if storage memory usage is below max memory
[ https://issues.apache.org/jira/browse/SPARK-12165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12165: Assignee: Apache Spark (was: Josh Rosen) > Execution memory requests may fail to evict storage blocks if storage memory > usage is below max memory > -- > > Key: SPARK-12165 > URL: https://issues.apache.org/jira/browse/SPARK-12165 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > Consider a scenario where storage memory usage has grown past the size of the > unevictable storage region ({{spark.memory.storageFraction}} * maxMemory) and > a task needs to acquire more execution memory by reclaiming evictable storage > memory. If the storage memory usage is less than maxMemory, then there's a > possibility that no storage blocks will be evicted. This is caused by how > {{MemoryStore.ensureFreeSpace()}} is called inside of > {{StorageMemoryPool.shrinkPoolToReclaimSpace()}}. > Here's a failing regression test which demonstrates this bug: > https://github.com/apache/spark/commit/b519fe628a9a2b8238dfedbfd9b74bdd2ddc0de4?diff=unified#diff-b3a7cd2e011e048908d70f743c0ed7cfR155 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12168) Need test for masked function
Felix Cheung created SPARK-12168: Summary: Need test for masked function Key: SPARK-12168 URL: https://issues.apache.org/jira/browse/SPARK-12168 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.2 Reporter: Felix Cheung Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12172) Remove SparkR internal RDD APIs
Felix Cheung created SPARK-12172: Summary: Remove SparkR internal RDD APIs Key: SPARK-12172 URL: https://issues.apache.org/jira/browse/SPARK-12172 Project: Spark Issue Type: Sub-task Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12173) Consider supporting DataSet API in SparkR
Felix Cheung created SPARK-12173: Summary: Consider supporting DataSet API in SparkR Key: SPARK-12173 URL: https://issues.apache.org/jira/browse/SPARK-12173 Project: Spark Issue Type: Sub-task Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12171) Support DataSet API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui updated SPARK-12171: Component/s: Spark Submit > Support DataSet API in SparkR > - > > Key: SPARK-12171 > URL: https://issues.apache.org/jira/browse/SPARK-12171 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12171) Support DataSet API in SparkR
Sun Rui created SPARK-12171: --- Summary: Support DataSet API in SparkR Key: SPARK-12171 URL: https://issues.apache.org/jira/browse/SPARK-12171 Project: Spark Issue Type: New Feature Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12171) Support DataSet API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui updated SPARK-12171: Component/s: (was: Spark Submit) SparkR > Support DataSet API in SparkR > - > > Key: SPARK-12171 > URL: https://issues.apache.org/jira/browse/SPARK-12171 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12169) SparkR 2.0
[ https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui updated SPARK-12169: Issue Type: Improvement (was: Bug) > SparkR 2.0 > -- > > Key: SPARK-12169 > URL: https://issues.apache.org/jira/browse/SPARK-12169 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Sun Rui > > This is an umbrella issue addressing all SparkR related issues corresponding > to Spark 2.0 being planned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12172) Remove SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044393#comment-15044393 ] Sun Rui commented on SPARK-12172: - Not sure if we want to remove RDD API. Need more discussion. > Remove SparkR internal RDD APIs > --- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12166) Unset hadoop related environment in testing
[ https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-12166: --- Priority: Minor (was: Major) > Unset hadoop related environment in testing > > > Key: SPARK-12166 > URL: https://issues.apache.org/jira/browse/SPARK-12166 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.5.2 >Reporter: Jeff Zhang >Priority: Minor > > I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause > is that spark is still using my local single node cluster hadoop when doing > the unit test. I don't think it make sense to do that. These environment > variable should be unset before the testing. And I suspect dev/run-tests also > didn't do that either. > Here's the error message: > {code} > Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root > scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: > rwxr-xr-x > [info] at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > [info] at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > [info] at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > [info] at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries
[ https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12167: Assignee: Apache Spark > Invoke the right sameResult function when plan is warpped with SubQueries > - > > Key: SPARK-12167 > URL: https://issues.apache.org/jira/browse/SPARK-12167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yadong Qi >Assignee: Apache Spark > > I find this bug when I use cache table, > ``` > spark-sql> create table src_p(key int, value int) stored as parquet; > OK > Time taken: 3.144 seconds > spark-sql> cache table src_p; > Time taken: 1.452 seconds > spark-sql> explain extended select count(*) from src_p; > ``` > I got the wrong physical plan > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#28L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#33L]) >Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][] > ``` > and the right physical plan is > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#47L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#62L]) >InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, > 1, StorageLevel(true, true, false, true, 1), (Scan > ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]), > Some(src_p)) > ``` > When the implementation classes of `MultiInstanceRelation`(eg. > `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't > invoke the right `sameResult` function in their own implementation. So we > need to eliminate SubQueries first and then try to invoke `sameResult` > function in their own implementation. > Like: > When plan is > `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p], > expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first > eliminate SubQueries, and then will invoke the `sameResult` function in > `LogicalRelation` instead of `LogicalPlan`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries
[ https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044335#comment-15044335 ] Apache Spark commented on SPARK-12167: -- User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/10169 > Invoke the right sameResult function when plan is warpped with SubQueries > - > > Key: SPARK-12167 > URL: https://issues.apache.org/jira/browse/SPARK-12167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yadong Qi > > I find this bug when I use cache table, > ``` > spark-sql> create table src_p(key int, value int) stored as parquet; > OK > Time taken: 3.144 seconds > spark-sql> cache table src_p; > Time taken: 1.452 seconds > spark-sql> explain extended select count(*) from src_p; > ``` > I got the wrong physical plan > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#28L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#33L]) >Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][] > ``` > and the right physical plan is > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#47L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#62L]) >InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, > 1, StorageLevel(true, true, false, true, 1), (Scan > ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]), > Some(src_p)) > ``` > When the implementation classes of `MultiInstanceRelation`(eg. > `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't > invoke the right `sameResult` function in their own implementation. So we > need to eliminate SubQueries first and then try to invoke `sameResult` > function in their own implementation. > Like: > When plan is > `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p], > expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first > eliminate SubQueries, and then will invoke the `sameResult` function in > `LogicalRelation` instead of `LogicalPlan`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries
[ https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12167: Assignee: (was: Apache Spark) > Invoke the right sameResult function when plan is warpped with SubQueries > - > > Key: SPARK-12167 > URL: https://issues.apache.org/jira/browse/SPARK-12167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yadong Qi > > I find this bug when I use cache table, > ``` > spark-sql> create table src_p(key int, value int) stored as parquet; > OK > Time taken: 3.144 seconds > spark-sql> cache table src_p; > Time taken: 1.452 seconds > spark-sql> explain extended select count(*) from src_p; > ``` > I got the wrong physical plan > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#28L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#33L]) >Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][] > ``` > and the right physical plan is > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#47L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#62L]) >InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, > 1, StorageLevel(true, true, false, true, 1), (Scan > ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]), > Some(src_p)) > ``` > When the implementation classes of `MultiInstanceRelation`(eg. > `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't > invoke the right `sameResult` function in their own implementation. So we > need to eliminate SubQueries first and then try to invoke `sameResult` > function in their own implementation. > Like: > When plan is > `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p], > expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first > eliminate SubQueries, and then will invoke the `sameResult` function in > `LogicalRelation` instead of `LogicalPlan`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12169) SparkR 2.0
[ https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044390#comment-15044390 ] Felix Cheung commented on SPARK-12169: -- Great thanks for opening this. I think we should definitely consider removing RDD APIs - that would help getting to a smaller code base. > SparkR 2.0 > -- > > Key: SPARK-12169 > URL: https://issues.apache.org/jira/browse/SPARK-12169 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Sun Rui > > This is an umbrella issue addressing all SparkR related issues corresponding > to Spark 2.0 being planned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries
[ https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044488#comment-15044488 ] Yadong Qi commented on SPARK-12167: --- SPARK-11246 has already fixed it > Invoke the right sameResult function when plan is warpped with SubQueries > - > > Key: SPARK-12167 > URL: https://issues.apache.org/jira/browse/SPARK-12167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yadong Qi > > I find this bug when I use cache table, > ``` > spark-sql> create table src_p(key int, value int) stored as parquet; > OK > Time taken: 3.144 seconds > spark-sql> cache table src_p; > Time taken: 1.452 seconds > spark-sql> explain extended select count(*) from src_p; > ``` > I got the wrong physical plan > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#28L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#33L]) >Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][] > ``` > and the right physical plan is > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#47L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#62L]) >InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, > 1, StorageLevel(true, true, false, true, 1), (Scan > ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]), > Some(src_p)) > ``` > When the implementation classes of `MultiInstanceRelation`(eg. > `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't > invoke the right `sameResult` function in their own implementation. So we > need to eliminate SubQueries first and then try to invoke `sameResult` > function in their own implementation. > Like: > When plan is > `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p], > expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first > eliminate SubQueries, and then will invoke the `sameResult` function in > `LogicalRelation` instead of `LogicalPlan`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.
[ https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044348#comment-15044348 ] Josh Rosen commented on SPARK-12155: I think this is blocked by the fix for SPARK-12165, a closely-related bug which impacts the eviction of storage memory in a single-concurrent-task case. > Execution OOM after a relative large dataset cached in the cluster. > --- > > Key: SPARK-12155 > URL: https://issues.apache.org/jira/browse/SPARK-12155 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Yin Huai >Assignee: Josh Rosen >Priority: Blocker > > I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. > When I start to consume the query. I got the following exception (I added > more logs to the code). > {code} > 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for > 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize. > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for > block rdd_94_37(free: 3253659951, max: 16798973952) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for > block rdd_94_37(free: 3252611375, max: 16798973952) > 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). > 3028 bytes result sent to driver > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for > block rdd_94_37(free: 3314840375, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for > block rdd_94_37(free: 3215892137, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space > for block rdd_94_37(free: 3117216424, max: 16866344960) > 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space > for block rdd_94_37(free: 2919868859, max: 16866344960) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space > for block rdd_94_37(free: 2687050010, max: 16929521664) > 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). > 3028 bytes result sent to driver > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space > for block rdd_94_37(free: 2292321531, max: 16929521664) > 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space > for block rdd_94_37(free: 1701062715, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space > for block rdd_94_37(free: 799417533, max: 16929521664) > 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would > require dropping another block from the same RDD > 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in > memory! (computed 2.4 GB so far) > 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB > (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB. > 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory > from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes > memory. But, on-heap execution memory poll only has 0 bytes free memory. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage > 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize > 8464760832. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from > storage memory pool. > 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory > space from StorageMemoryPool. > 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of > memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool. > 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). > 3077 bytes result sent to driver > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120 > 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120) > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124 > 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend:
[jira] [Commented] (SPARK-12045) Use joda's DateTime to replace Calendar
[ https://issues.apache.org/jira/browse/SPARK-12045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044360#comment-15044360 ] Wenchen Fan commented on SPARK-12045: - 1. We need to process on UTF8String(an internal string presentation in spark SQL) , and that's why we need to do it by hand(see DateTimeUtils.stringToDate, DateTimeUtils.stringToTimestamp). And yes, you can turn UTF8String to String first and call third-party library like joda to help us, but that will be inefficient. 2. We follow hive to return null for invalid format string, but I agree throw exception by default seems more reasonable. cc [~marmbrus] > Use joda's DateTime to replace Calendar > --- > > Key: SPARK-12045 > URL: https://issues.apache.org/jira/browse/SPARK-12045 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Jeff Zhang > > Currently spark use Calendar to build the Date when convert from string to > Date. But Calendar can not detect the invalid date format (e.g. 2011-02-29). > Although we can use Calendar.setLenient(false) to enable Calendar to detect > the invalid date format, but found the error message very confusing. So I > suggest to use joda's DateTime to replace Calendar. > Besides that, I found that there's already some format checking logic when > casting string to date. And if it is invalid format, it would return None. I > don't think it make sense to just return None without telling users. I think > by default should just throw exception, and user can set property to allow it > return None if invalid format. > {code} > if (i == 0 && j != 4) { > // year should have exact four digits > return None > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12168) Need test for conflicted function in R
[ https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-12168: - Summary: Need test for conflicted function in R (was: Need test for masked function) > Need test for conflicted function in R > -- > > Key: SPARK-12168 > URL: https://issues.apache.org/jira/browse/SPARK-12168 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Priority: Minor > > Currently it is hard to know if a function in base or stats packages are > masked when add new function in SparkR. > Having an automated test would make it easier to track such changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12168) Need test for masked function
[ https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-12168: - Description: Currently it is hard to know if a function in base or stats packages are masked when add new function in SparkR. Having an automated test would make it easier to track such changes. > Need test for masked function > - > > Key: SPARK-12168 > URL: https://issues.apache.org/jira/browse/SPARK-12168 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Priority: Minor > > Currently it is hard to know if a function in base or stats packages are > masked when add new function in SparkR. > Having an automated test would make it easier to track such changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12170) Deprecate the JAVA-specific deserialized storage levels
Sun Rui created SPARK-12170: --- Summary: Deprecate the JAVA-specific deserialized storage levels Key: SPARK-12170 URL: https://issues.apache.org/jira/browse/SPARK-12170 Project: Spark Issue Type: Sub-task Reporter: Sun Rui This is to be consistent with SPARK-12091 which is for pySpark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12169) SparkR 2.0
[ https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044394#comment-15044394 ] Felix Cheung commented on SPARK-12169: -- For those who's reading - we shouldn't open PR yet. See SPARK-11806 > SparkR 2.0 > -- > > Key: SPARK-12169 > URL: https://issues.apache.org/jira/browse/SPARK-12169 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Sun Rui > > This is an umbrella issue addressing all SparkR related issues corresponding > to Spark 2.0 being planned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12165) Execution memory requests may fail to evict storage blocks if storage memory usage is below max memory
[ https://issues.apache.org/jira/browse/SPARK-12165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12165: Assignee: Josh Rosen (was: Apache Spark) > Execution memory requests may fail to evict storage blocks if storage memory > usage is below max memory > -- > > Key: SPARK-12165 > URL: https://issues.apache.org/jira/browse/SPARK-12165 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > Consider a scenario where storage memory usage has grown past the size of the > unevictable storage region ({{spark.memory.storageFraction}} * maxMemory) and > a task needs to acquire more execution memory by reclaiming evictable storage > memory. If the storage memory usage is less than maxMemory, then there's a > possibility that no storage blocks will be evicted. This is caused by how > {{MemoryStore.ensureFreeSpace()}} is called inside of > {{StorageMemoryPool.shrinkPoolToReclaimSpace()}}. > Here's a failing regression test which demonstrates this bug: > https://github.com/apache/spark/commit/b519fe628a9a2b8238dfedbfd9b74bdd2ddc0de4?diff=unified#diff-b3a7cd2e011e048908d70f743c0ed7cfR155 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12165) Execution memory requests may fail to evict storage blocks if storage memory usage is below max memory
[ https://issues.apache.org/jira/browse/SPARK-12165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044342#comment-15044342 ] Apache Spark commented on SPARK-12165: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10170 > Execution memory requests may fail to evict storage blocks if storage memory > usage is below max memory > -- > > Key: SPARK-12165 > URL: https://issues.apache.org/jira/browse/SPARK-12165 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > Consider a scenario where storage memory usage has grown past the size of the > unevictable storage region ({{spark.memory.storageFraction}} * maxMemory) and > a task needs to acquire more execution memory by reclaiming evictable storage > memory. If the storage memory usage is less than maxMemory, then there's a > possibility that no storage blocks will be evicted. This is caused by how > {{MemoryStore.ensureFreeSpace()}} is called inside of > {{StorageMemoryPool.shrinkPoolToReclaimSpace()}}. > Here's a failing regression test which demonstrates this bug: > https://github.com/apache/spark/commit/b519fe628a9a2b8238dfedbfd9b74bdd2ddc0de4?diff=unified#diff-b3a7cd2e011e048908d70f743c0ed7cfR155 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12168) Need test for conflicted function in R
[ https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12168: Assignee: (was: Apache Spark) > Need test for conflicted function in R > -- > > Key: SPARK-12168 > URL: https://issues.apache.org/jira/browse/SPARK-12168 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Priority: Minor > > Currently it is hard to know if a function in base or stats packages are > masked when add new function in SparkR. > Having an automated test would make it easier to track such changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12171) Support DataSet API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui closed SPARK-12171. --- Resolution: Duplicate > Support DataSet API in SparkR > - > > Key: SPARK-12171 > URL: https://issues.apache.org/jira/browse/SPARK-12171 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6990) Add Java linting script
[ https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6990: -- Assignee: Dmitry Erastov > Add Java linting script > --- > > Key: SPARK-6990 > URL: https://issues.apache.org/jira/browse/SPARK-6990 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Reporter: Josh Rosen >Assignee: Dmitry Erastov >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > > It would be nice to add a {{dev/lint-java}} script to enforce style rules for > Spark's Java code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12167) Invoke the right sameResult function when plan is warpped with SubQueries
[ https://issues.apache.org/jira/browse/SPARK-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Qi closed SPARK-12167. - Resolution: Duplicate > Invoke the right sameResult function when plan is warpped with SubQueries > - > > Key: SPARK-12167 > URL: https://issues.apache.org/jira/browse/SPARK-12167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yadong Qi > > I find this bug when I use cache table, > ``` > spark-sql> create table src_p(key int, value int) stored as parquet; > OK > Time taken: 3.144 seconds > spark-sql> cache table src_p; > Time taken: 1.452 seconds > spark-sql> explain extended select count(*) from src_p; > ``` > I got the wrong physical plan > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#28L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#33L]) >Scan ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][] > ``` > and the right physical plan is > ``` > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[_c0#47L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#62L]) >InMemoryColumnarTableScan (InMemoryRelation [key#45,value#46], true, > 1, StorageLevel(true, true, false, true, 1), (Scan > ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p][key#9,value#10]), > Some(src_p)) > ``` > When the implementation classes of `MultiInstanceRelation`(eg. > `LogicalRelation`, `LocalRelation`) are warpped with SubQueries, they can't > invoke the right `sameResult` function in their own implementation. So we > need to eliminate SubQueries first and then try to invoke `sameResult` > function in their own implementation. > Like: > When plan is > `Subquery(LogicalRelation(relation:ParquetRelation[hdfs://9.91.8.131:9000/user/hive/warehouse/src_p], > expectedOutputAttributes:Some(ArrayBuffer(key#0, value#1`, first > eliminate SubQueries, and then will invoke the `sameResult` function in > `LogicalRelation` instead of `LogicalPlan`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12166) Unset hadoop related environment in testing
[ https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044531#comment-15044531 ] Apache Spark commented on SPARK-12166: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/10172 > Unset hadoop related environment in testing > > > Key: SPARK-12166 > URL: https://issues.apache.org/jira/browse/SPARK-12166 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.5.2 >Reporter: Jeff Zhang >Priority: Minor > > I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause > is that spark is still using my local single node cluster hadoop when doing > the unit test. I don't think it make sense to do that. These environment > variable should be unset before the testing. And I suspect dev/run-tests also > didn't do that either. > Here's the error message: > {code} > Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root > scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: > rwxr-xr-x > [info] at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > [info] at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > [info] at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > [info] at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12166) Unset hadoop related environment in testing
[ https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12166: Assignee: Apache Spark > Unset hadoop related environment in testing > > > Key: SPARK-12166 > URL: https://issues.apache.org/jira/browse/SPARK-12166 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.5.2 >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > > I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause > is that spark is still using my local single node cluster hadoop when doing > the unit test. I don't think it make sense to do that. These environment > variable should be unset before the testing. And I suspect dev/run-tests also > didn't do that either. > Here's the error message: > {code} > Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root > scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: > rwxr-xr-x > [info] at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > [info] at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > [info] at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > [info] at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12166) Unset hadoop related environment in testing
[ https://issues.apache.org/jira/browse/SPARK-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12166: Assignee: (was: Apache Spark) > Unset hadoop related environment in testing > > > Key: SPARK-12166 > URL: https://issues.apache.org/jira/browse/SPARK-12166 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 1.5.2 >Reporter: Jeff Zhang >Priority: Minor > > I try to do test on HiveSparkSubmitSuite on local box, but fails. The cause > is that spark is still using my local single node cluster hadoop when doing > the unit test. I don't think it make sense to do that. These environment > variable should be unset before the testing. And I suspect dev/run-tests also > didn't do that either. > Here's the error message: > {code} > Cause: java.lang.RuntimeException: java.lang.RuntimeException: The root > scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: > rwxr-xr-x > [info] at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > [info] at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > [info] at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > [info] at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org