Structured Streaming partition logic with respect to storage and fileformat
when we use readStream to read data as Stream, how spark decides the no of RDD and partition within each RDD with respect to storage and file format. val dsJson = sqlContext.readStream.json("/Users/sachin/testSpark/inputJson") val dsCsv = sqlContext.readStream.option("header","true").csv( "/Users/sachin/testSpark/inputCsv") val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText") val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split(" ")(1))).toDF("name","age") val dsParquet = sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet") -- Thanks & Regards Sachin Aggarwal 7760502772
Possible contribution to MLlib
Hi all, In my team, we are currently developing a fork of spark MLlib extending K-means method such that it is possible to set its own distance function. In this implementation, it could be possible to directly pass, in argument of the K-means train function, a distance function whose signature is: (VectorWithNorm, VectorWithNorm) => Double. We have found the Jira instance SPARK-11665 proposing to support new distances in bisecting K-means. There has also been the Jira instance SPARK-3219 proposing to add Bregman divergences as distance functions, but it has not been added to MLlib. Therefore, we are wondering if such an extension of MLlib K-means algorithm would be appreciated by the community and would have chances to get included in future spark releases. Regards, Simon Nanty
Re: [VOTE] Release Apache Spark 1.6.2 (RC2)
The PR (https://github.com/apache/spark/pull/13055) to fix https://issues.apache.org/jira/browse/SPARK-15262 was applied to 1.6.2 however this fix caused another issue https://issues.apache.org/jira/browse/SPARK-15606 the fix for which ( https://github.com/apache/spark/pull/13355) has not been backported to the 1.6 branch so I'm now seeing the same failure in 1.6.2 Cheers, On Mon, 20 Jun 2016 at 05:25 Reynold Xinwrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and > passes if a majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 1.6.2 > [ ] -1 Do not release this package because ... > > > The tag to be voted on is v1.6.2-rc2 > (54b1121f351f056d6b67d2bb4efe0d553c0f7482) > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1186/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/ > > > === > == How can I help test this release? == > === > If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions from 1.6.1. > > > == What justifies a -1 vote for this release? == > > This is a maintenance release in the 1.6.x series. Bugs already present > in 1.6.1, missing features, or bugs related to new features will not > necessarily block this release. > > > >
Re: Structured Streaming partition logic with respect to storage and fileformat
Based on the underlying Hadoop FileFormat. This one does it mostly based on blocksize. You can change this though. > On 21 Jun 2016, at 12:19, Sachin Aggarwalwrote: > > > when we use readStream to read data as Stream, how spark decides the no of > RDD and partition within each RDD with respect to storage and file format. > > val dsJson = sqlContext.readStream.json("/Users/sachin/testSpark/inputJson") > > val dsCsv = > sqlContext.readStream.option("header","true").csv("/Users/sachin/testSpark/inputCsv") > val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText") > val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split(" > ")(1))).toDF("name","age") > > val dsParquet = > sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet") > > > -- > > Thanks & Regards > > Sachin Aggarwal > 7760502772
Re: Question about Bloom Filter in Spark 2.0
SPARK-12818 is about building a bloom filter on existing data. It has nothing to do with the ORC bloom filter, which can be used to do predicate pushdown. On Tue, Jun 21, 2016 at 7:45 PM, BaiRanwrote: > Hi all, > > I have a question about bloom filter implementation in Spark-12818 issue. > If I have a ORC file with bloom filter metadata, how can I utilise it by > Spark SQL? > Thanks. > > Best, > Ran >
Re: [VOTE] Release Apache Spark 2.0.0 (RC1)
While I'd officially -1 this while there are still many blockers, this should certainly be tested as usual, because they're mostly doc and "audit" type issues. On Wed, Jun 22, 2016 at 2:26 AM, Reynold Xinwrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and passes > if a majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 2.0.0 > [ ] -1 Do not release this package because ... > > > The tag to be voted on is v2.0.0-rc1 > (0c66ca41afade6db73c9aeddd5aed6e5dcea90df). > > This release candidate resolves ~2400 issues: > https://s.apache.org/spark-2.0.0-rc1-jira > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1187/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/ > > > === > == How can I help test this release? == > === > If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions from 1.x. > > > == What justifies a -1 vote for this release? == > > Critical bugs impacting major functionalities. > > Bugs already present in 1.x, missing features, or bugs related to new > features will not necessarily block this release. Note that historically > Spark documentation has been published on the website separately from the > main release so we do not need to block the release due to documentation > errors either. > > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[VOTE] Release Apache Spark 2.0.0 (RC1)
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.0 [ ] -1 Do not release this package because ... The tag to be voted on is v2.0.0-rc1 (0c66ca41afade6db73c9aeddd5aed6e5dcea90df). This release candidate resolves ~2400 issues: https://s.apache.org/spark-2.0.0-rc1-jira The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1187/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/ === == How can I help test this release? == === If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.x. == What justifies a -1 vote for this release? == Critical bugs impacting major functionalities. Bugs already present in 1.x, missing features, or bugs related to new features will not necessarily block this release. Note that historically Spark documentation has been published on the website separately from the main release so we do not need to block the release due to documentation errors either.
Question about Bloom Filter in Spark 2.0
Hi all, I have a question about bloom filter implementation in Spark-12818 issue. If I have a ORC file with bloom filter metadata, how can I utilise it by Spark SQL? Thanks. Best, Ran
Re: Structured Streaming partition logic with respect to storage and fileformat
what will the scenario in case of s3 and local file system? On Tue, Jun 21, 2016 at 4:36 PM, Jörn Frankewrote: > Based on the underlying Hadoop FileFormat. This one does it mostly based > on blocksize. You can change this though. > > On 21 Jun 2016, at 12:19, Sachin Aggarwal > wrote: > > > when we use readStream to read data as Stream, how spark decides the no of > RDD and partition within each RDD with respect to storage and file format. > > val dsJson = sqlContext.readStream.json( > "/Users/sachin/testSpark/inputJson") > > val dsCsv = sqlContext.readStream.option("header","true").csv( > "/Users/sachin/testSpark/inputCsv") > > val ds = sqlContext.readStream.text("/Users/sachin/testSpark/inputText") > val dsText = ds.as[String].map(x =>(x.split(" ")(0),x.split(" > ")(1))).toDF("name","age") > > val dsParquet = > sqlContext.readStream.format("parquet").parquet("/Users/sachin/testSpark/inputParquet") > > > > -- > > Thanks & Regards > > Sachin Aggarwal > 7760502772 > > -- Thanks & Regards Sachin Aggarwal 7760502772
Re: Possible contribution to MLlib
I think it is valuable to make the distance function pluggable and also provide some builtin distance function. This might be also useful for other algorithms besides KMeans. On Tue, Jun 21, 2016 at 7:48 PM, Simon NANTYwrote: > Hi all, > > > > In my team, we are currently developing a fork of spark MLlib extending > K-means method such that it is possible to set its own distance function. > In this implementation, it could be possible to directly pass, in argument > of the K-means train function, a distance function whose signature is: > (VectorWithNorm, VectorWithNorm) => Double. > > > > We have found the Jira instance SPARK-11665 proposing to support new > distances in bisecting K-means. There has also been the Jira instance > SPARK-3219 proposing to add Bregman divergences as distance functions, but > it has not been added to MLlib. Therefore, we are wondering if such an > extension of MLlib K-means algorithm would be appreciated by the community > and would have chances to get included in future spark releases. > > > > Regards, > > > > Simon Nanty > > > -- Best Regards Jeff Zhang
Jar for Spark developement
Hi, Beginner in Spark development. Took time to configure Eclipse + Scala. Is there any tutorial that can help beginners. Still struggling to find Spark JAR files for development. There is no lib folder in my Spark distribution (neither in pre-built nor in custom built..) Regards,
Re: [VOTE] Release Apache Spark 1.6.2 (RC2)
Nice one, yeah indeed I was doing an incremental build. Not a blocker. I'll have a look into the others, though I suspect they're problems with tests rather than production code. On Tue, Jun 21, 2016 at 6:53 PM, Marcelo Vanzinwrote: > On Tue, Jun 21, 2016 at 10:49 AM, Sean Owen wrote: >> I'm getting some errors building on Ubuntu 16 + Java 7. First is one >> that may just be down to a Scala bug: >> >> [ERROR] bad symbolic reference. A signature in WebUI.class refers to >> term eclipse >> in package org which is not available. > > This is probably https://issues.apache.org/jira/browse/SPARK-13780. It > should only affect incremental builds ("mvn -rf ..." or "mvn -pl > ..."), not clean builds. Not sure about the other ones. > > -- > Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.2 (RC2)
On Tue, Jun 21, 2016 at 10:49 AM, Sean Owenwrote: > I'm getting some errors building on Ubuntu 16 + Java 7. First is one > that may just be down to a Scala bug: > > [ERROR] bad symbolic reference. A signature in WebUI.class refers to > term eclipse > in package org which is not available. This is probably https://issues.apache.org/jira/browse/SPARK-13780. It should only affect incremental builds ("mvn -rf ..." or "mvn -pl ..."), not clean builds. Not sure about the other ones. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.2 (RC2)
Hey Pete, I didn't backport it to 1.6 because it just affects tests in most cases. I'm sure we also have other places calling blocking methods in the event loops, so similar issues are still there even after applying this patch. Hence, I don't think it's a blocker for 1.6.2. On Tue, Jun 21, 2016 at 2:57 AM, Pete Robbinswrote: > The PR (https://github.com/apache/spark/pull/13055) to fix > https://issues.apache.org/jira/browse/SPARK-15262 was applied to 1.6.2 > however this fix caused another issue > https://issues.apache.org/jira/browse/SPARK-15606 the fix for which ( > https://github.com/apache/spark/pull/13355) has not been backported to > the 1.6 branch so I'm now seeing the same failure in 1.6.2 > > Cheers, > > On Mon, 20 Jun 2016 at 05:25 Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and >> passes if a majority of at least 3+1 PMC votes are cast. >> >> [ ] +1 Release this package as Apache Spark 1.6.2 >> [ ] -1 Do not release this package because ... >> >> >> The tag to be voted on is v1.6.2-rc2 >> (54b1121f351f056d6b67d2bb4efe0d553c0f7482) >> >> The release files, including signatures, digests, etc. can be found at: >> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/ >> >> Release artifacts are signed with the following key: >> https://people.apache.org/keys/committer/pwendell.asc >> >> The staging repository for this release can be found at: >> https://repository.apache.org/content/repositories/orgapachespark-1186/ >> >> The documentation corresponding to this release can be found at: >> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/ >> >> >> === >> == How can I help test this release? == >> === >> If you are a Spark user, you can help us test this release by taking an >> existing Spark workload and running on this release candidate, then >> reporting any regressions from 1.6.1. >> >> >> == What justifies a -1 vote for this release? == >> >> This is a maintenance release in the 1.6.x series. Bugs already present >> in 1.6.1, missing features, or bugs related to new features will not >> necessarily block this release. >> >> >> >>
Re: [VOTE] Release Apache Spark 1.6.2 (RC2)
I'm getting some errors building on Ubuntu 16 + Java 7. First is one that may just be down to a Scala bug: [ERROR] bad symbolic reference. A signature in WebUI.class refers to term eclipse in package org which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling WebUI.class. [ERROR] bad symbolic reference. A signature in WebUI.class refers to term jetty in value org.eclipse which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling WebUI.class. But I'm seeing some consistent timezone-related failures, from core: UIUtilsSuite: - formatBatchTime *** FAILED *** "2015/05/14 [14]:04:40" did not equal "2015/05/14 [21]:04:40" (UIUtilsSuite.scala:73) and several from Spark SQL, like: - udf_unix_timestamp *** FAILED *** Results do not match for udf_unix_timestamp: == Parsed Logical Plan == 'Project [unresolvedalias(2009-03-20 11:30:01),unresolvedalias('unix_timestamp(2009-03-20 11:30:01))] +- 'UnresolvedRelation `oneline`, None == Analyzed Logical Plan == _c0: string, _c1: bigint Project [2009-03-20 11:30:01 AS _c0#122914,unixtimestamp(2009-03-20 11:30:01,-MM-dd HH:mm:ss) AS _c1#122915L] +- MetastoreRelation default, oneline, None == Optimized Logical Plan == Project [2009-03-20 11:30:01 AS _c0#122914,1237548601 AS _c1#122915L] +- MetastoreRelation default, oneline, None == Physical Plan == Project [2009-03-20 11:30:01 AS _c0#122914,1237548601 AS _c1#122915L] +- HiveTableScan MetastoreRelation default, oneline, None _c0 _c1 !== HIVE - 1 row(s) ==== CATALYST - 1 row(s) == !2009-03-20 11:30:01 1237573801 2009-03-20 11:30:01 1237548601 (HiveComparisonTest.scala:458) I'll start looking into them. It could be real, if possibly minor, bugs because I presume most of the testing happens on machines in a PDT timezone instead of UTC? that's at least the timezone of the machine I'm testing on. On Mon, Jun 20, 2016 at 5:24 AM, Reynold Xinwrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and > passes if a majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 1.6.2 > [ ] -1 Do not release this package because ... > > > The tag to be voted on is v1.6.2-rc2 > (54b1121f351f056d6b67d2bb4efe0d553c0f7482) > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1186/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/ > > > === > == How can I help test this release? == > === > If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions from 1.6.1. > > > == What justifies a -1 vote for this release? == > > This is a maintenance release in the 1.6.x series. Bugs already present in > 1.6.1, missing features, or bugs related to new features will not > necessarily block this release. > > > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.6.2 (RC2)
It breaks Spark running on machines with less than 3 cores/threads, which may be rare, and it is maybe an edge case. Personally, I like to fix known bugs and the fact there are other blocking methods in event loops actually makes it worse not to fix ones that you know about. Probably not a blocker to release though but that's your call. Cheers, On Tue, Jun 21, 2016 at 6:40 PM Shixiong(Ryan) Zhuwrote: > Hey Pete, > > I didn't backport it to 1.6 because it just affects tests in most cases. > I'm sure we also have other places calling blocking methods in the event > loops, so similar issues are still there even after applying this patch. > Hence, I don't think it's a blocker for 1.6.2. > > On Tue, Jun 21, 2016 at 2:57 AM, Pete Robbins wrote: > >> The PR (https://github.com/apache/spark/pull/13055) to fix >> https://issues.apache.org/jira/browse/SPARK-15262 was applied to 1.6.2 >> however this fix caused another issue >> https://issues.apache.org/jira/browse/SPARK-15606 the fix for which ( >> https://github.com/apache/spark/pull/13355) has not been backported to >> the 1.6 branch so I'm now seeing the same failure in 1.6.2 >> >> Cheers, >> >> On Mon, 20 Jun 2016 at 05:25 Reynold Xin wrote: >> >>> Please vote on releasing the following candidate as Apache Spark version >>> 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and >>> passes if a majority of at least 3+1 PMC votes are cast. >>> >>> [ ] +1 Release this package as Apache Spark 1.6.2 >>> [ ] -1 Do not release this package because ... >>> >>> >>> The tag to be voted on is v1.6.2-rc2 >>> (54b1121f351f056d6b67d2bb4efe0d553c0f7482) >>> >>> The release files, including signatures, digests, etc. can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/ >>> >>> Release artifacts are signed with the following key: >>> https://people.apache.org/keys/committer/pwendell.asc >>> >>> The staging repository for this release can be found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1186/ >>> >>> The documentation corresponding to this release can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/ >>> >>> >>> === >>> == How can I help test this release? == >>> === >>> If you are a Spark user, you can help us test this release by taking an >>> existing Spark workload and running on this release candidate, then >>> reporting any regressions from 1.6.1. >>> >>> >>> == What justifies a -1 vote for this release? == >>> >>> This is a maintenance release in the 1.6.x series. Bugs already present >>> in 1.6.1, missing features, or bugs related to new features will not >>> necessarily block this release. >>> >>> >>> >>> >
Re: [VOTE] Release Apache Spark 1.6.2 (RC2)
Hey Pete, I just pushed your PR to branch 1.6. As it's not a blocker, it may or may not be in 1.6.2, depending on if there will be another RC. On Tue, Jun 21, 2016 at 1:36 PM, Pete Robbinswrote: > It breaks Spark running on machines with less than 3 cores/threads, which > may be rare, and it is maybe an edge case. > > Personally, I like to fix known bugs and the fact there are other blocking > methods in event loops actually makes it worse not to fix ones that you > know about. > > Probably not a blocker to release though but that's your call. > > Cheers, > > On Tue, Jun 21, 2016 at 6:40 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Hey Pete, >> >> I didn't backport it to 1.6 because it just affects tests in most cases. >> I'm sure we also have other places calling blocking methods in the event >> loops, so similar issues are still there even after applying this patch. >> Hence, I don't think it's a blocker for 1.6.2. >> >> On Tue, Jun 21, 2016 at 2:57 AM, Pete Robbins >> wrote: >> >>> The PR (https://github.com/apache/spark/pull/13055) to fix >>> https://issues.apache.org/jira/browse/SPARK-15262 was applied to 1.6.2 >>> however this fix caused another issue >>> https://issues.apache.org/jira/browse/SPARK-15606 the fix for which ( >>> https://github.com/apache/spark/pull/13355) has not been backported to >>> the 1.6 branch so I'm now seeing the same failure in 1.6.2 >>> >>> Cheers, >>> >>> On Mon, 20 Jun 2016 at 05:25 Reynold Xin wrote: >>> Please vote on releasing the following candidate as Apache Spark version 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.2 [ ] -1 Do not release this package because ... The tag to be voted on is v1.6.2-rc2 (54b1121f351f056d6b67d2bb4efe0d553c0f7482) The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1186/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/ === == How can I help test this release? == === If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 1.6.1. == What justifies a -1 vote for this release? == This is a maintenance release in the 1.6.x series. Bugs already present in 1.6.1, missing features, or bugs related to new features will not necessarily block this release. >>