Re: Machine learning question (suing spark)- removing redundant factors while doing clustering
There must be an algorithmic way to figure out which of these factors contribute the least and remove them in the analysis. I am hoping same one can throw some insight on this. On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kuma...@me.com> wrote: > Not an expert here, but the first step would be devote some time and > identify which of these 112 factors are actually causative. Some domain > knowledge of the data may be required. Then, you can start of with PCA. > > HTH, > > Regards, > > Sivakumaran S > > On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane@gmail.com> wrote: > > Great question Rohit. I am in my early days of ML as well and it would be > great if we get some idea on this from other experts on this group. > > I know we can reduce dimensions by using PCA, but i think that does not > allow us to understand which factors from the original are we using in the > end. > > - Tony L. > > On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <rohitchaddha1...@gmail.com> > wrote: > >> >> I have a data-set where each data-point has 112 factors. >> >> I want to remove the factors which are not relevant, and say reduce to 20 >> factors out of these 112 and then do clustering of data-points using these >> 20 factors. >> >> How do I do these and how do I figure out which of the 20 factors are >> useful for analysis. >> >> I see SVD and PCA implementations, but I am not sure if these give which >> elements are removed and which are remaining. >> >> Can someone please help me understand what to do here >> >> thanks, >> -Rohit >> >> > >
Re: Machine learning question (suing spark)- removing redundant factors while doing clustering
Great question Rohit. I am in my early days of ML as well and it would be great if we get some idea on this from other experts on this group. I know we can reduce dimensions by using PCA, but i think that does not allow us to understand which factors from the original are we using in the end. - Tony L. On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddhawrote: > > I have a data-set where each data-point has 112 factors. > > I want to remove the factors which are not relevant, and say reduce to 20 > factors out of these 112 and then do clustering of data-points using these > 20 factors. > > How do I do these and how do I figure out which of the 20 factors are > useful for analysis. > > I see SVD and PCA implementations, but I am not sure if these give which > elements are removed and which are remaining. > > Can someone please help me understand what to do here > > thanks, > -Rohit > >
Re: Kmeans dataset initialization
Can anyone suggest how I can initialize kmeans structure directly from a dataset of Row On Sat, Aug 6, 2016 at 1:03 AM, Tony Lane <tonylane@gmail.com> wrote: > I have all the data required for KMeans in a dataset in memory > > Standard approach to load this data from a file is > spark.read().format("libsvm").load(filename) > > where the file has data in the format > 0 1:0.0 2:0.0 3:0.0 > > > How do i this from an in-memory dataset already present. > Any suggestions ? > > -Tony > >
Kmeans dataset initialization
I have all the data required for KMeans in a dataset in memory Standard approach to load this data from a file is spark.read().format("libsvm").load(filename) where the file has data in the format 0 1:0.0 2:0.0 3:0.0 How do i this from an in-memory dataset already present. Any suggestions ? -Tony
Re: Generating unique id for a column in Row without breaking into RDD and joining back
Mike. I have figured how to do this . Thanks for the suggestion. It works great. I am trying to figure out the performance impact of this. thanks again On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane <tonylane@gmail.com> wrote: > @mike - this looks great. How can i do this in java ? what is the > performance implication on a large dataset ? > > @sonal - I can't have a collision in the values. > > On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com> > wrote: > >> You can use the monotonically_increasing_id method to generate guaranteed >> unique (but not necessarily consecutive) IDs. Calling something like: >> >> df.withColumn("id", monotonically_increasing_id()) >> >> You don't mention which language you're using but you'll need to pull in >> the sql.functions library. >> >> Mike >> >> On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane@gmail.com> wrote: >> >> Ayan - basically i have a dataset with structure, where bid are unique >> string values >> >> bid: String >> val : integer >> >> I need unique int values for these string bid''s to do some processing in >> the dataset >> >> like >> >> id:int (unique integer id for each bid) >> bid:String >> val:integer >> >> >> >> -Tony >> >> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> Hi >>> >>> Can you explain a little further? >>> >>> best >>> Ayan >>> >>> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane@gmail.com> >>> wrote: >>> >>>> I have a row with structure like >>>> >>>> identifier: String >>>> value: int >>>> >>>> All identifier are unique and I want to generate a unique long id for >>>> the data and get a row object back for further processing. >>>> >>>> I understand using the zipWithUniqueId function on RDD, but that would >>>> mean first converting to RDD and then joining back the RDD and dataset >>>> >>>> What is the best way to do this ? >>>> >>>> -Tony >>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >
Re: Generating unique id for a column in Row without breaking into RDD and joining back
@mike - this looks great. How can i do this in java ? what is the performance implication on a large dataset ? @sonal - I can't have a collision in the values. On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com> wrote: > You can use the monotonically_increasing_id method to generate guaranteed > unique (but not necessarily consecutive) IDs. Calling something like: > > df.withColumn("id", monotonically_increasing_id()) > > You don't mention which language you're using but you'll need to pull in > the sql.functions library. > > Mike > > On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane@gmail.com> wrote: > > Ayan - basically i have a dataset with structure, where bid are unique > string values > > bid: String > val : integer > > I need unique int values for these string bid''s to do some processing in > the dataset > > like > > id:int (unique integer id for each bid) > bid:String > val:integer > > > > -Tony > > On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> Can you explain a little further? >> >> best >> Ayan >> >> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane@gmail.com> >> wrote: >> >>> I have a row with structure like >>> >>> identifier: String >>> value: int >>> >>> All identifier are unique and I want to generate a unique long id for >>> the data and get a row object back for further processing. >>> >>> I understand using the zipWithUniqueId function on RDD, but that would >>> mean first converting to RDD and then joining back the RDD and dataset >>> >>> What is the best way to do this ? >>> >>> -Tony >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > >
Re: Generating unique id for a column in Row without breaking into RDD and joining back
Ayan - basically i have a dataset with structure, where bid are unique string values bid: String val : integer I need unique int values for these string bid''s to do some processing in the dataset like id:int (unique integer id for each bid) bid:String val:integer -Tony On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > Can you explain a little further? > > best > Ayan > > On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane@gmail.com> wrote: > >> I have a row with structure like >> >> identifier: String >> value: int >> >> All identifier are unique and I want to generate a unique long id for the >> data and get a row object back for further processing. >> >> I understand using the zipWithUniqueId function on RDD, but that would >> mean first converting to RDD and then joining back the RDD and dataset >> >> What is the best way to do this ? >> >> -Tony >> >> > > > -- > Best Regards, > Ayan Guha >
Generating unique id for a column in Row without breaking into RDD and joining back
I have a row with structure like identifier: String value: int All identifier are unique and I want to generate a unique long id for the data and get a row object back for further processing. I understand using the zipWithUniqueId function on RDD, but that would mean first converting to RDD and then joining back the RDD and dataset What is the best way to do this ? -Tony
Re: Using sparse vector leads to array out of bounds exception
I guess the setup of the model and usage of the vector got to me. Setup takes position 1 , 2 , 3 - like this in the build example - "1:0.0 2:0.0 3:0.0" I thought I need to follow the same numbering while creating vector too. thanks a bunch On Thu, Aug 4, 2016 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote: > You mean "new int[] {0,1,2}" because vectors are 0-indexed. > > On Wed, Aug 3, 2016 at 11:52 AM, Tony Lane <tonylane@gmail.com> wrote: > > Hi Sean, > > > > I did not understand, > > I created a KMeansModel with 3 dimensions and then I am calling predict > > method on this model with a 3 dimension vector ? > > I am not sre what is wrong in this approach. i am missing a point ? > > > > Tony > > > > On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> You declare that the vector has 3 dimensions, but then refer to its > >> 4th dimension (at index 3). That is the error. > >> > >> On Wed, Aug 3, 2016 at 10:43 AM, Tony Lane <tonylane@gmail.com> > wrote: > >> > I am using the following vector definition in java > >> > > >> > Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 > >> > })) > >> > > >> > However when I run the predict method on this vector it leads to > >> > > >> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3 > >> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:143) > >> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:115) > >> > at > >> > > >> > > org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:298) > >> > at > >> > > >> > > org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:606) > >> > at > >> > > >> > > org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:580) > >> > at > >> > > >> > > org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:574) > >> > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74) > >> > at > >> > > org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:574) > >> > at > >> > > >> > > org.apache.spark.mllib.clustering.KMeansModel.predict(KMeansModel.scala:59) > >> > at > org.apache.spark.ml.clustering.KMeansModel.predict(KMeans.scala:130) > > > > >
Re: Using sparse vector leads to array out of bounds exception
Hi Sean, I did not understand, I created a KMeansModel with 3 dimensions and then I am calling predict method on this model with a 3 dimension vector ? I am not sre what is wrong in this approach. i am missing a point ? Tony On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen <so...@cloudera.com> wrote: > You declare that the vector has 3 dimensions, but then refer to its > 4th dimension (at index 3). That is the error. > > On Wed, Aug 3, 2016 at 10:43 AM, Tony Lane <tonylane@gmail.com> wrote: > > I am using the following vector definition in java > > > > Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 })) > > > > However when I run the predict method on this vector it leads to > > > > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3 > > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:143) > > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:115) > > at > > > org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:298) > > at > > > org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:606) > > at > > > org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:580) > > at > > > org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:574) > > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74) > > at > org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:574) > > at > > > org.apache.spark.mllib.clustering.KMeansModel.predict(KMeansModel.scala:59) > > at org.apache.spark.ml.clustering.KMeansModel.predict(KMeans.scala:130) >
Using sparse vector leads to array out of bounds exception
I am using the following vector definition in java Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 })) However when I run the predict method on this vector it leads to Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3 at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:143) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:115) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:298) at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:606) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:580) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:574) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74) at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:574) at org.apache.spark.mllib.clustering.KMeansModel.predict(KMeansModel.scala:59) at org.apache.spark.ml.clustering.KMeansModel.predict(KMeans.scala:130)
Re: Stop Spark Streaming Jobs
SparkSession exposes stop() method On Wed, Aug 3, 2016 at 8:53 AM, Pradeepwrote: > Thanks Park. I am doing the same. Was trying to understand if there are > other ways. > > Thanks, > Pradeep > > > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee > wrote: > > > > So sorry. Your name was Pradeep !! > > > > -Original Message- > > From: Park Kyeong Hee [mailto:kh1979.p...@samsung.com] > > Sent: Wednesday, August 03, 2016 11:24 AM > > To: 'Pradeep'; 'user@spark.apache.org' > > Subject: RE: Stop Spark Streaming Jobs > > > > Hi. Paradeep > > > > > > Did you mean, how to kill the job? > > If yes, you should kill the driver and follow next. > > > > on yarn-client > > 1. find pid - "ps -es | grep " > > 2. kill it - "kill -9 " > > 3. check executors were down - "yarn application -list" > > > > on yarn-cluster > > 1. find driver's application ID - "yarn application -list" > > 2. stop it - "yarn application -kill " > > 3. check driver and executors were down - "yarn application -list" > > > > > > Thanks. > > > > -Original Message- > > From: Pradeep [mailto:pradeep.mi...@mail.com] > > Sent: Wednesday, August 03, 2016 10:48 AM > > To: user@spark.apache.org > > Subject: Stop Spark Streaming Jobs > > > > Hi All, > > > > My streaming job reads data from Kafka. The job is triggered and pushed > to > > background with nohup. > > > > What are the recommended ways to stop job either on yarn-client or > cluster > > mode. > > > > Thanks, > > Pradeep > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Error in building spark core on windows - any suggestions please
I am trying to build spark in windows, and getting the following test failures and consequent build failures. [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ spark-core_2.11 --- --- T E S T S --- Running org.apache.spark.api.java.OptionalSuite Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec - in org.apache.spark.api.java.OptionalSuite Running org.apache.spark.JavaAPISuite Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.792 sec <<< FAILURE! - in org.apache.spark.JavaAPISuite wholeTextFiles(org.apache.spark.JavaAPISuite) Time elapsed: 0.382 sec <<< FAILURE! java.lang.AssertionError: expected: but was: at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089) Running org.apache.spark.JavaJdbcRDDSuite Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.43 sec - in org.apache.spark.JavaJdbcRDDSuite Running org.apache.spark.launcher.SparkLauncherSuite Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.047 sec <<< FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite) Time elapsed: 0.032 sec <<< FAILURE! java.lang.AssertionError: expected:<0> but was:<1> at org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:169) Running org.apache.spark.memory.TaskMemoryManagerSuite Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec - in org.apache.spark.memory.TaskMemoryManagerSuite Running org.apache.spark.shuffle.sort.PackedRecordPointerSuite Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec - in org.apache.spark.shuffle.sort.PackedRecordPointerSuite Running org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.132 sec - in org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite Running org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec - in org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite Running org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.597 sec - in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.117 sec - in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.697 sec - in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite Running org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.853 sec - in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorte rRadixSortSuite Running org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.624 sec - in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorte rSuite Running org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec - in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter RadixSortSuite Running org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec - in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter Suite Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Results : Failed tests: JavaAPISuite.wholeTextFiles:1089 expected: but was: SparkLauncherSuite.testChildProcLauncher:169 expected:<0> but was:<1> Tests run: 195, Failures: 2, Errors: 0, Skipped: 0 [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 11.038 s] [INFO] Spark Project Tags . SUCCESS [ 11.611 s] [INFO] Spark Project Sketch ... SUCCESS [ 27.037 s] [INFO] Spark Project Networking ... SUCCESS [ 54.003 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 17.955 s] [INFO] Spark Project Unsafe ... SUCCESS [ 21.667 s] [INFO] Spark Project Launcher . SUCCESS [ 17.632 s] [INFO] Spark Project Core . FAILURE [04:56 min] [INFO] Spark Project GraphX ... SKIPPED [INFO] Spark Project Streaming
error while running filter on dataframe
Can someone help me understand this error which occurs while running a filter on a dataframe 2016-07-31 21:01:57 ERROR CodeGenerator:91 - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 117, Column 58: Expression "mapelements_isNull" is not an rvalue /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ /** Codegened pipeline for: /* 006 */ * TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#127L]) /* 007 */ +- Project /* 008 */ +- Filter (is... /* 009 */ */ /* 010 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 011 */ private Object[] references; /* 012 */ private boolean agg_initAgg; /* 013 */ private boolean agg_bufIsNull; /* 014 */ private long agg_bufValue; /* 015 */ private scala.collection.Iterator inputadapter_input; /* 016 */ private Object[] deserializetoobject_values; /* 017 */ private org.apache.spark.sql.types.StructType deserializetoobject_schema; /* 018 */ private UnsafeRow deserializetoobject_result; /* 019 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder deserializetoobject_holder; /* 020 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter deserializetoobject_rowWriter; /* 021 */ private UnsafeRow mapelements_result; /* 022 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder mapelements_holder; /* 023 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter mapelements_rowWriter; /* 024 */ private Object[] serializefromobject_values; /* 025 */ private UnsafeRow serializefromobject_result; /* 026 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; /* 027 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; /* 028 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter1; /* 029 */ private org.apache.spark.sql.execution.metric.SQLMetric filter_numOutputRows; /* 030 */ private UnsafeRow filter_result; /* 031 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder filter_holder; /* 032 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter filter_rowWriter; /* 033 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter filter_rowWriter1; /* 034 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_numOutputRows; /* 035 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_aggTime; /* 036 */ private UnsafeRow agg_result; /* 037 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; /* 038 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; /* 039 */ /* 040 */ public GeneratedIterator(Object[] references) { /* 041 */ this.references = references; /* 042 */ } /* 043 */
spark java - convert string to date
Any built in function in java with spark to convert string to date more efficiently or do we just use the standard java techniques -Tony
Visualization of data analysed using spark
I am developing my analysis application by using spark (in eclipse as the IDE) what is a good way to visualize the data, taking into consideration i have multiple files which make up my spark application. I have seen some notebook demo's but not sure how to use my application with such notebooks. thoughts/ suggestions/ experiences -- please share -Tony
Re: how to order data in descending order in spark dataset
just to clarify I am try to do this in java ts.groupBy("b").count().orderBy("count"); On Sun, Jul 31, 2016 at 12:00 AM, Tony Lane <tonylane@gmail.com> wrote: > ts.groupBy("b").count().orderBy("count"); > > how can I order this data in descending order of count > Any suggestions > > -Tony >
how to order data in descending order in spark dataset
ts.groupBy("b").count().orderBy("count"); how can I order this data in descending order of count Any suggestions -Tony
Spark 2.0 blocker on windows - spark-warehouse path issue
Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/ibm/spark-warehouse Anybody knows a solution to this? cheers tony
Re: Spark 2.0 -- spark warehouse relative path in absolute URI error
I am facing the same issue and completely blocked here. *Sean can you please help with this issue. * Migrating to 2.0.0 has really stalled our development effort. -Tony > -- Forwarded message -- > From: Sean Owen> Date: Fri, Jul 29, 2016 at 12:47 AM > Subject: Re: Spark 2.0 -- spark warehouse relative path in absolute URI > error > To: Rohit Chaddha > Cc: "user @spark" > > > Ah, right. This wasn't actually resolved. Yeah your input on 15899 > would be welcome. See if the proposed fix helps. > > On Thu, Jul 28, 2016 at 11:52 AM, Rohit Chaddha > wrote: > > Sean, > > > > I saw some JIRA tickets and looks like this is still an open bug (rather > > than an improvement as marked in JIRA). > > > > https://issues.apache.org/jira/browse/SPARK-15893 > > https://issues.apache.org/jira/browse/SPARK-15899 > > > > I am experimenting, but do you know of any solution on top of your head > > > > > > > > On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha < > rohitchaddha1...@gmail.com> > > wrote: > >> > >> I am simply trying to do > >> session.read().json("file:///C:/data/a.json"); > >> > >> in 2.0.0-preview it was working fine with > >> sqlContext.read().json("C:/data/a.json"); > >> > >> > >> -Rohit > >> > >> On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen wrote: > >>> > >>> Hm, file:///C:/... doesn't work? that should certainly be an absolute > >>> URI with an absolute path. What exactly is your input value for this > >>> property? > >>> > >>> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha > >>> wrote: > >>> > Hello Sean, > >>> > > >>> > I have tried both file:/ and file:/// > >>> > Bit it does not work and give the same error > >>> > > >>> > -Rohit > >>> > > >>> > > >>> > > >>> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen > wrote: > >>> >> > >>> >> IIRC that was fixed, in that this is actually an invalid URI. Use > >>> >> file:/C:/... I think. > >>> >> > >>> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha > >>> >> wrote: > >>> >> > I upgraded from 2.0.0-preview to 2.0.0 > >>> >> > and I started getting the following error > >>> >> > > >>> >> > Caused by: java.net.URISyntaxException: Relative path in absolute > >>> >> > URI: > >>> >> > file:C:/ibm/spark-warehouse > >>> >> > > >>> >> > Any ideas how to fix this > >>> >> > > >>> >> > -Rohit > >>> > > >>> > > >> > >> > > > >