Re: Prediction using Classification with text attributes in Apache Spark MLLib
Trying to improve the old solution. Do we have a better text classifier now in Spark Mllib? Regards, lmk -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Bad Digest error while doing aws s3 put
Hi Dhimant, As I had indicated in my next mail, my problem was due to disk getting full with log messages (these were dumped into the slaves) and did not have anything to do with the content pushed into s3. So, looks like this error message is very generic and is thrown for various reasons. You may probably have to do some more research to find out the cause of your problem.. Please keep me posted once you fix this issue. Sorry, I could not be of much help to you.. Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p26174.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD saveToCassandra
Hi Michael, Please correct me if I am wrong. The error seems to originate from spark only. Please have a look at the stack trace of the error which is as follows: [error] (run-main-0) java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row at com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134) at com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134) at scala.util.Try.getOrElse(Try.scala:77) at com.datastax.spark.connector.rdd.reader.AnyObjectFactory$.resolveConstructor(AnyObjectFactory.scala:133) at com.datastax.spark.connector.mapper.ReflectionColumnMapper.columnMap(ReflectionColumnMapper.scala:46) at com.datastax.spark.connector.writer.DefaultRowWriter.init(DefaultRowWriter.scala:20) at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:109) at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:107) at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:171) at com.datastax.spark.connector.RDDFunctions.tableWriter(RDDFunctions.scala:76) at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:56) at ScalaSparkCassandra$.main(ScalaSparkCassandra.scala:66) at ScalaSparkCassandra.main(ScalaSparkCassandra.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) Please clarify. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951p14340.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SchemaRDD saveToCassandra
Hi, My requirement is to extract certain fields from json files, run queries on them and save the result to cassandra. I was able to parse json , filter the result and save the rdd(regular) to cassandra. Now, when I try to read the json file through sqlContext , execute some queries on the same and then save the SchemaRDD to cassandra using saveToCassandra function, I am getting the following error: java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row Pls let me know if a spark SchemaRDD can be directly saved to cassandra just like the regular rdd? If that is not possible, is there any way to convert the schema RDD to a regular RDD ? Please advise. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
NotSerializableException while doing rdd.saveToCassandra
Hi All, I am using spark-1.0.0 to parse a json file and save to values to cassandra using case class. My code looks as follows: case class LogLine(x1:Option[String],x2: Option[String],x3:Option[List[String]],x4: Option[String],x5:Option[String],x6:Option[String],x7:Option[String],x8:Option[String],x9:Option[String]) val data = test.map(line = { parse(line) }).map(json = { // Extract the values implicit lazy val formats = org.json4s.DefaultFormats val x1 = (json \ x1).extractOpt[String] val x2 = (json \ x2).extractOpt[String] val x4=(json \ x4).extractOpt[String] val x5=(json \ x5).extractOpt[String] val x6=(json \ x6).extractOpt[String] val x7=(json \ x7).extractOpt[String] val x8=(json \ x8).extractOpt[String] val x3=(json \ x3).extractOpt[List[String]] val x9=(json \ x9).extractOpt[String] LogLine(x1,x2,x3,x4,x5,x6,x7,x8,x9) }) data.saveToCassandra(test, test_data, Seq(x1, x2, x3, x4, x5, x6, x7, x8, x9)) whereas the cassandra table schema is as follows: CREATE TABLE test_data ( x1 varchar, x2 varchar, x4 varchar, x5 varchar, x6 varchar, x7 varchar, x8 varchar, x3 listtext , x9 varchar, PRIMARY KEY (x1)); I am getting the following error on executing the saveToCassandra statement: 14/08/27 11:33:59 INFO SparkContext: Starting job: runJob at package.scala:169 14/08/27 11:33:59 INFO DAGScheduler: Got job 5 (runJob at package.scala:169) with 1 output partitions (allowLocal=false) 14/08/27 11:33:59 INFO DAGScheduler: Final stage: Stage 5(runJob at package.scala:169) 14/08/27 11:33:59 INFO DAGScheduler: Parents of final stage: List() 14/08/27 11:33:59 INFO DAGScheduler: Missing parents: List() 14/08/27 11:33:59 INFO DAGScheduler: Submitting Stage 5 (MappedRDD[7] at map at console:45), which has no missing parents 14/08/27 11:33:59 INFO DAGScheduler: Failed to run runJob at package.scala:169 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkConf data.saveToCassandra(test, test_data, Seq(x1, x2, x3, x4, x5, x6, x7, x8, x9)) Here the data field is org.apache.spark.rdd.RDD[LogLine] = MappedRDD[7] at map at console:45 How can I convert this to Serializable, or is this a different problem? Please advise. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-while-doing-rdd-saveToCassandra-tp12906.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra
Hi Yana I have done take and confirmed existence of data..Also checked that it is getting connected to Cassandra.. That is why I suspect that this particular rdd is not serializable.. Thanks, Lmk On Aug 28, 2014 5:13 AM, Yana [via Apache Spark User List] ml-node+s1001560n12960...@n3.nabble.com wrote: I'm not so sure that your error is coming from the cassandra write. you have val data = test.map(..).map(..) so data will actually not get created until you try to save it. Can you try to do something like data.count() or data.take(k) after this line and see if you even get to the cassandra part? My suspicion is that you're trying to access something (SparkConf?) within the map closures... -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12960.html To unsubscribe from Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=12906code=bGFrc2htaS5tdXJhbGlrcmlzaG5hbkBnbWFpbC5jb218MTI5MDZ8LTEzOTI0NzEwNjA= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Cassandra-NotSerializable-Exception-while-saving-to-cassandra-tp12906p12984.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Bad Digest error while doing aws s3 put
This was a completely misleading error message.. The problem was due to a log message getting dumped to the stdout. This was getting accumulated in the workers and hence there was no space left on device after some time. When I re-tested with spark-0.9.1, the saveAsTextFile api threw no space left on device error after writing the same 48 files. On checking the master, it was all ok. But on checking the slaves, the stdout contributed to 99% of the root filesystem. On removing the particular log, it is now working fine in both the versions. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p11642.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Bad Digest error while doing aws s3 put
Is it possible that the Content-MD5 changes during multipart upload to s3? But even then, it succeeds if I increase the cluster configuration.. For ex. it throws Bad Digest error after writing 48/100 files when the cluster is of 3 m3.2xlarge slaves it throws Bad Digest error after writing 64/100 files when the cluster is of 4 m3.2xlarge slaves it throws Bad Digest error after writing 86/100 files when the cluster is of 5 m3.2xlarge slaves it succeeds writing all the 100 files when the cluster is of 6 m3.2xlarge slaves.. Please clarify. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p11421.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Bad Digest error while doing aws s3 put
Thanks Patrick. But why am I getting a Bad Digest error when I am saving large amount of data to s3? /Loss was due to org.apache.hadoop.fs.s3.S3Exception org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/spark_test%2Fsmaato_one_day_phase_2%2Fsmaato_2014_05_17%2F_temporary%2F_attempt_201408041624__m_65_165%2Fpart-00065' XML Error Message: ?xml version=1.0 encoding=UTF-8?ErrorCodeBadDigest/CodeMessageThe Content-MD5 you specified did not match what we received./MessageExpectedDigestlb2tDEVSSnRNM4pw6504Bg==/ExpectedDigestCalculatedDigestEL9UDBzFvTwJycA7Ii2KGA==/CalculatedDigestRequestId437F15C89D355081/RequestIdHostIdkJQI+c9edzBmT2Z9sbfAELYT/8R5ezLWeUgeIU37iPsq5KQm/qAXItunZY35wnYx/HostId/Error at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:82)/ As indicated earlier, I use the following command as an alternative to saveAsTextFile: /x.map(x = (NullWritable.get(), new Text(x.toString))).coalesce(100).saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](s3n://dest-dir/) In the above case, it succeeds till it writes some 48 part files out of 100 (but this 48 also is inconsistent) and then starts throwing the above error. The same works well if I increase the capacity of the cluster (say from 3 m3.2xlarge slaves to 6), or reduce the data size. Is there a possibility that the data is getting corrupt when the load increases? Please advice. I am stuck with this problem for the past couple of weeks. Thanks, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p11345.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Bad Digest error while doing aws s3 put
Hi I was using saveAsTextFile earlier. It was working fine. When we migrated to spark-1.0, I started getting the following error: java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) Hence I changed my code as follows: x.map(x = (NullWritable.get(), new Text(x.toString))).saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path) After this I am facing this problem when I write very huge data to s3. This also occurs while writing to some partitions only, say while writing to 240 partitions, it might succeed for 156 files and then it will start throwing the Bad Digest Error and then it hangs. Please advise. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p10780.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
save to HDFS
Hi, I have a scala application which I have launched into a spark cluster. I have the following statement trying to save to a folder in the master: saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](hdfs://masteripaddress:9000/root/test-app/test1/) The application is executed successfully and log says that save is complete also. But I am not able to find the file I have saved anywhere. Is there a way I can access this file? Pls advice. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: save to HDFS
Hi Akhil, I am sure that the RDD that I saved is not empty. I have tested it using take. But is there no way that I can see this saved physically like we do in the normal context? Can't I view this folder as I am already logged into the cluster? And, should I run hadoop fs -ls hdfs://masteripaddress:9000/root/test-app/test1/ after I login to the cluster? Thanks, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578p10581.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: save to HDFS
Thanks Akhil. I was able to view the files. Actually I was trying to list the same using regular ls and since it did not show anything I was concerned. Thanks for showing me the right direction. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578p10583.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Bad Digest error while doing aws s3 put
Hi, I am getting the following error while trying save a large dataset to s3 using the saveAsHadoopFile command with apache spark-1.0. org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/spark_test%2Fsmaato_one_day_phase_2%2Fsmaato_2014_05_17%2F_temporary%2F_attempt_201407170658__m_36_276%2Fpart-00036' XML Error Message: ?xml version=1.0 encoding=UTF-8?ErrorCodeBadDigest/CodeMessageThe Content-MD5 you specified did not match what we received./MessageExpectedDigestN808DtNfYiTFzI+i2HxLEw==/ExpectedDigestCalculatedDigest66nS+2C1QqQmmcTeFpXOjw==/CalculatedDigestRequestId4FB3A3D60B187CE7/RequestIdHostIdH2NznP+RvwspekVHBMvgWGYAupKuO5YceSgmiLym6rOajOh5v5GnyM0VkO+dadyG/HostId/Error I have used the same command to write similar content with lesser data to s3 without any problem. When I googled this error message, they say it might be due to md5 checksum mismatch. But will this happen due to load? Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Prediction using Classification with text attributes in Apache Spark MLLib
Hi, I am trying to predict an attribute with binary value (Yes/No) using SVM. All my attributes which belong to the training set are text attributes. I understand that I have to convert my outcome as double (0.0/1.0). But I donot understand how to deal with my explanatory variables which are also text. Please let me know how I can do this. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Prediction using Classification with text attributes in Apache Spark MLLib
Hi Alexander, Thanks for your prompt response. Earlier I was executing this Prediction using Weka only. But now we are moving to a huge dataset and hence to Apache Spark MLLib. Is there any other way to convert to libSVM format? Or is there any other simpler algorithm that I can use in mllib? Thanks, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p8168.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Can this be done in map-reduce technique (in parallel)
Hi Cheng, Sorry Again. In this method, i see that the values for a - positions.iterator b - positions.iterator always remain the same. I tried to do a b - positions.iterator.next, it throws an error: value filter is not a member of (Double, Double) Is there something I am missing out here? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Can this be done in map-reduce technique (in parallel)
Hi Cheng, Thanks a lot. That solved my problem. Thanks again for the quick response and solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7047.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Can this be done in map-reduce technique (in parallel)
Hi, I am a new spark user. Pls let me know how to handle the following scenario: I have a data set with the following fields: 1. DeviceId 2. latitude 3. longitude 4. ip address 5. Datetime 6. Mobile application name With the above data, I would like to perform the following steps: 1. Collect all lat and lon for each ipaddress (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5)) 2. For each IP, 1.Find the distance between each lat and lon coordinate pair and all the other pairs under the same IP 2.Select those coordinates whose distances fall under a specific threshold (say 100m) 3.Find the coordinate pair with the maximum occurrences In this case, how can I iterate and compare each coordinate pair with all the other pairs? Can this be done in a distributed manner, as this data set is going to have a few million records? Can we do this in map/reduce commands? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Can this be done in map-reduce technique (in parallel)
Hi Oleg/Andrew, Thanks much for the prompt response. We expect thousands of lat/lon pairs for each IP address. And that is my concern with the Cartesian product approach. Currently for a small sample of this data (5000 rows) I am grouping by IP address and then computing the distance between lat/lon coordinates using array manipulation techniques. But I understand this approach is not right when the data volume goes up. My code is as follows: val dataset:RDD[String] = sc.textFile(x.csv) val data = dataset.map(l=l.split(,)) val grpData = data.map(r = (r(3),((r(1).toDouble),r(2).toDouble))).groupByKey() Now, I have the data grouped by ipaddress as Array[(String, Iterable[(Double, Double)])] ex.. Array((ip1,ArrayBuffer((lat1,lon1), (lat2,lon2), (lat3,lon3))) Now I have to find the distance between (lat1,lon1) and (lat2,lon2) and then between (lat1,lon1) and (lat3,lon3) and so on for all combinations. This is where I get stuck. Please guide me on this. Thanks Again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html Sent from the Apache Spark User List mailing list archive at Nabble.com.