Re: Question about SparkSQL and Hive-on-Spark
Hi, Will We are planning to start implementing these functions. We hope that we could make a general design in following week. Best Regards, Yi Tian tianyi.asiai...@gmail.com On Sep 23, 2014, at 23:39, Will Benton wrote: > Hi Yi, > > I've had some interest in implementing windowing and rollup in particular for > some of my applications but haven't had them on the front of my plate yet. > If you need them as well, I'm happy to start taking a look this week. > > > best, > wb > > > - Original Message - >> From: "Yi Tian" >> To: dev@spark.apache.org >> Sent: Tuesday, September 23, 2014 2:47:17 AM >> Subject: Question about SparkSQL and Hive-on-Spark >> >> Hi all, >> >> I have some questions about the SparkSQL and Hive-on-Spark >> >> Will SparkSQL support all the hive feature in the future? or just making hive >> as a datasource of Spark? >> >> From Spark 1.1.0 , we have thrift-server support running hql on spark. Will >> this feature be replaced by Hive on Spark? >> >> The reason for asking these questions is that we found some hive functions >> are not running well on SparkSQL ( like window function, cube and rollup >> function) >> >> Is it worth for making effort on implement these functions with SparkSQL? >> Could you guys give some advices ? >> >> thank you. >> >> >> Best Regards, >> >> Yi Tian >> tianyi.asiai...@gmail.com >> >> >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A couple questions about shared variables
Filed https://issues.apache.org/jira/browse/SPARK-3642 for documenting these nuances. -Sandy On Mon, Sep 22, 2014 at 10:36 AM, Nan Zhu wrote: > I see, thanks for pointing this out > > > -- > Nan Zhu > > On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote: > > MapReduce counters do not count duplications. In MapReduce, if a task > needs to be re-run, the value of the counter from the second task > overwrites the value from the first task. > > -Sandy > > On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu wrote: > > If you think it as necessary to fix, I would like to resubmit that PR > (seems to have some conflicts with the current DAGScheduler) > > My suggestion is to make it as an option in accumulator, e.g. some > algorithms utilizing accumulator for result calculation, it needs a > deterministic accumulator, while others implementing something like Hadoop > counters may need the current implementation (count everything happened, > including the duplications) > > Your thoughts? > > -- > Nan Zhu > > On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote: > > Hmm, good point, this seems to have been broken by refactorings of the > scheduler, but it worked in the past. Basically the solution is simple -- > in a result stage, we should not apply the update for each task ID more > than once -- the same way we don't call job.listener.taskSucceeded more > than once. Your PR also tried to avoid this for resubmitted shuffle stages, > but I don't think we need to do that necessarily (though we could). > > Matei > > On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com) > wrote: > > Hi, Matei, > > Can you give some hint on how the current implementation guarantee the > accumulator is only applied for once? > > There is a pending PR trying to achieving this ( > https://github.com/apache/spark/pull/228/files), but from the current > implementation, I didn’t see this has been done? (maybe I missed something) > > Best, > > -- > Nan Zhu > > On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote: > > Hey Sandy, > > On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) > wrote: > > Hey All, > > A couple questions came up about shared variables recently, and I wanted > to > confirm my understanding and update the doc to be a little more clear. > > *Broadcast variables* > Now that tasks data is automatically broadcast, the only occasions where > it > makes sense to explicitly broadcast are: > * You want to use a variable from tasks in multiple stages. > * You want to have the variable stored on the executors in deserialized > form. > * You want tasks to be able to modify the variable and have those > modifications take effect for other tasks running on the same executor > (usually a very bad idea). > > Is that right? > Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also > matters. (We might later factor tasks in a different way to avoid 2, but > it's hard due to things like Hadoop JobConf objects in the tasks). > > > *Accumulators* > Values are only counted for successful tasks. Is that right? KMeans seems > to use it in this way. What happens if a node goes away and successful > tasks need to be resubmitted? Or the stage runs again because a different > job needed it. > Accumulators are guaranteed to give a deterministic result if you only > increment them in actions. For each result stage, the accumulator's update > from each task is only applied once, even if that task runs multiple times. > If you use accumulators in transformations (i.e. in a stage that may be > part of multiple jobs), then you may see multiple updates, from each run. > This is kind of confusing but it was useful for people who wanted to use > these for debugging. > > Matei > > > > > > thanks, > Sandy > > > > > >
SPARK-3660 : Initial RDD for updateStateByKey transformation
Hello fellow developers, Thanks TD for relevant pointers. I have created an issue : https://issues.apache.org/jira/browse/SPARK-3660 Copying the description from JIRA: " How to initialize state tranformation updateStateByKey? I have word counts from previous spark-submit run, and want to load that in next spark-submit job to start counting over that. One proposal is to add following argument to updateStateByKey methods. initial : Option [RDD [(K, S)]] = None This will maintain the backward compatibility as well. I have a working code as well. This thread started on spark-user list at: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-updateStateByKey-operation-td14772.html " Please let me know if I shall add a parameter "initial : Option [RDD [(K, S)]] = None" to all updateStateByKey methods or create new ones? Thanks, -Soumitra. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Question about SparkSQL and Hive-on-Spark
Hi Will, We're also very interested in windowing support in SparkSQL. Let's us know once this is available for testing. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Sep 23, 2014 at 8:39 AM, Will Benton wrote: > Hi Yi, > > I've had some interest in implementing windowing and rollup in particular for > some of my applications but haven't had them on the front of my plate yet. > If you need them as well, I'm happy to start taking a look this week. > > > best, > wb > > > - Original Message - >> From: "Yi Tian" >> To: dev@spark.apache.org >> Sent: Tuesday, September 23, 2014 2:47:17 AM >> Subject: Question about SparkSQL and Hive-on-Spark >> >> Hi all, >> >> I have some questions about the SparkSQL and Hive-on-Spark >> >> Will SparkSQL support all the hive feature in the future? or just making hive >> as a datasource of Spark? >> >> From Spark 1.1.0 , we have thrift-server support running hql on spark. Will >> this feature be replaced by Hive on Spark? >> >> The reason for asking these questions is that we found some hive functions >> are not running well on SparkSQL ( like window function, cube and rollup >> function) >> >> Is it worth for making effort on implement these functions with SparkSQL? >> Could you guys give some advices ? >> >> thank you. >> >> >> Best Regards, >> >> Yi Tian >> tianyi.asiai...@gmail.com >> >> >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
spark.local.dir and spark.worker.dir not used
Hi, I am using spark 1.0.0. In my spark code i m trying to persist an rdd to disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the location where the rdd has been written to disk. I specified SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than using the default /tmp directory, but still couldnt see anything in worker directory andspark ocal directory. I also tried specifying the local dir and worker dir from the spark code while defining the SparkConf as conf.set("spark.local.dir", "/home/padma/sparkdir") but the directories are not used. In general which directories spark would be using for map output files, intermediate writes and persisting rdd to disk ? Thanks, Padma Ch
Re: OutOfMemoryError on parquet SnappyDecompressor
This may be related: https://github.com/Parquet/parquet-mr/issues/211 Perhaps if we change our configuration settings for Parquet it would get better, but the performance characteristics of Snappy are pretty bad here under some circumstances. On Tue, Sep 23, 2014 at 10:13 AM, Cody Koeninger wrote: > Cool, that's pretty much what I was thinking as far as configuration goes. > > Running on Mesos. Worker nodes are amazon xlarge, so 4 core / 15g. I've > tried executor memory sizes as high as 6G > Default hdfs block size 64m, about 25G of total data written by a job with > 128 partitions. The exception comes when trying to read the data (all > columns). > > Schema looks like this: > > case class A( > a: Long, > b: Long, > c: Byte, > d: Option[Long], > e: Option[Long], > f: Option[Long], > g: Option[Long], > h: Option[Int], > i: Long, > j: Option[Int], > k: Seq[Int], > l: Seq[Int], > m: Seq[Int] > ) > > We're just going back to gzip for now, but might be nice to help someone > else avoid running into this. > > On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust > > wrote: > > > I actually submitted a patch to do this yesterday: > > https://github.com/apache/spark/pull/2493 > > > > Can you tell us more about your configuration. In particular how much > > memory/cores do the executors have and what does the schema of your data > > look like? > > > > On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger > > wrote: > > > >> So as a related question, is there any reason the settings in SQLConf > >> aren't read from the spark context's conf? I understand why the sql > conf > >> is mutable, but it's not particularly user friendly to have most spark > >> configuration set via e.g. defaults.conf or --properties-file, but for > >> spark sql to ignore those. > >> > >> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger > >> wrote: > >> > >> > After commit 8856c3d8 switched from gzip to snappy as default parquet > >> > compression codec, I'm seeing the following when trying to read > parquet > >> > files saved using the new default (same schema and roughly same size > as > >> > files that were previously working): > >> > > >> > java.lang.OutOfMemoryError: Direct buffer memory > >> > java.nio.Bits.reserveMemory(Bits.java:658) > >> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > >> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > >> > > >> > > >> > parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99) > >> > > >> > > >> > parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43) > >> > java.io.DataInputStream.readFully(DataInputStream.java:195) > >> > java.io.DataInputStream.readFully(DataInputStream.java:169) > >> > > >> > > >> > parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201) > >> > > >> > > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521) > >> > > >> > > >> > parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493) > >> > > >> > > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546) > >> > > >> > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339) > >> > > >> > > >> > parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) > >> > > >> > > >> > parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) > >> > > >> > > >> > parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265) > >> > > >> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60) > >> > > >> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) > >> > > >> > > >> > parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) > >> > > >> > > >> > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) > >> > > >> > > >> > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) > >> > > >> > > >> > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139) > >> > > >> > > >> > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > >> > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > >> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256) > >> > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157) > >> > > >> > > >> > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220) > >> > > >> > > >> > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219) > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.
Re: OutOfMemoryError on parquet SnappyDecompressor
Cool, that's pretty much what I was thinking as far as configuration goes. Running on Mesos. Worker nodes are amazon xlarge, so 4 core / 15g. I've tried executor memory sizes as high as 6G Default hdfs block size 64m, about 25G of total data written by a job with 128 partitions. The exception comes when trying to read the data (all columns). Schema looks like this: case class A( a: Long, b: Long, c: Byte, d: Option[Long], e: Option[Long], f: Option[Long], g: Option[Long], h: Option[Int], i: Long, j: Option[Int], k: Seq[Int], l: Seq[Int], m: Seq[Int] ) We're just going back to gzip for now, but might be nice to help someone else avoid running into this. On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust wrote: > I actually submitted a patch to do this yesterday: > https://github.com/apache/spark/pull/2493 > > Can you tell us more about your configuration. In particular how much > memory/cores do the executors have and what does the schema of your data > look like? > > On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger > wrote: > >> So as a related question, is there any reason the settings in SQLConf >> aren't read from the spark context's conf? I understand why the sql conf >> is mutable, but it's not particularly user friendly to have most spark >> configuration set via e.g. defaults.conf or --properties-file, but for >> spark sql to ignore those. >> >> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger >> wrote: >> >> > After commit 8856c3d8 switched from gzip to snappy as default parquet >> > compression codec, I'm seeing the following when trying to read parquet >> > files saved using the new default (same schema and roughly same size as >> > files that were previously working): >> > >> > java.lang.OutOfMemoryError: Direct buffer memory >> > java.nio.Bits.reserveMemory(Bits.java:658) >> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) >> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) >> > >> > >> parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99) >> > >> > >> parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43) >> > java.io.DataInputStream.readFully(DataInputStream.java:195) >> > java.io.DataInputStream.readFully(DataInputStream.java:169) >> > >> > >> parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201) >> > >> > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521) >> > >> > >> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493) >> > >> > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546) >> > >> > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339) >> > >> > >> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) >> > >> > >> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) >> > >> > >> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265) >> > >> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60) >> > >> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) >> > >> > >> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) >> > >> > >> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) >> > >> > >> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) >> > >> > >> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139) >> > >> > >> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256) >> > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157) >> > >> > >> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220) >> > >> > >> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219) >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) >> > >> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> > >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) >> > org.apache.spark.scheduler.Task.run(Task.scala:54) >> > >> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) >> > >> > >> java.util.concurrent.ThreadPoolExecutor.ru
Re: OutOfMemoryError on parquet SnappyDecompressor
I actually submitted a patch to do this yesterday: https://github.com/apache/spark/pull/2493 Can you tell us more about your configuration. In particular how much memory/cores do the executors have and what does the schema of your data look like? On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger wrote: > So as a related question, is there any reason the settings in SQLConf > aren't read from the spark context's conf? I understand why the sql conf > is mutable, but it's not particularly user friendly to have most spark > configuration set via e.g. defaults.conf or --properties-file, but for > spark sql to ignore those. > > On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger > wrote: > > > After commit 8856c3d8 switched from gzip to snappy as default parquet > > compression codec, I'm seeing the following when trying to read parquet > > files saved using the new default (same schema and roughly same size as > > files that were previously working): > > > > java.lang.OutOfMemoryError: Direct buffer memory > > java.nio.Bits.reserveMemory(Bits.java:658) > > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > > > > > parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99) > > > > > parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43) > > java.io.DataInputStream.readFully(DataInputStream.java:195) > > java.io.DataInputStream.readFully(DataInputStream.java:169) > > > > > parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201) > > > > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521) > > > > parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493) > > > > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546) > > > > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339) > > > > > parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) > > > > > parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) > > > > > parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265) > > > parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60) > > > parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) > > > > > parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) > > > > > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) > > > > > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) > > > > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139) > > > > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > scala.collection.Iterator$class.isEmpty(Iterator.scala:256) > > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157) > > > > > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220) > > > > > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219) > > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) > > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) > > > > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > > org.apache.spark.scheduler.Task.run(Task.scala:54) > > > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > java.lang.Thread.run(Thread.java:722) > > > > > > >
Re: RFC: Deprecating YARN-alpha API's
Any other comments or objections on this? Thanks,Tom On Tuesday, September 9, 2014 4:39 PM, Chester Chen wrote: We were using it until recently, we are talking to our customers and see if we can get off it. Chester Alpine Data Labs On Tue, Sep 9, 2014 at 10:59 AM, Sean Owen wrote: > FWIW consensus from Cloudera folk seems to be that there's no need or > demand on this end for YARN alpha. It wouldn't have an impact if it > were removed sooner even. > > It will be a small positive to reduce complexity by removing this > support, making it a little easier to develop for current YARN APIs. > > On Tue, Sep 9, 2014 at 5:16 PM, Patrick Wendell > wrote: > > Hi Everyone, > > > > This is a call to the community for comments on SPARK-3445 [1]. In a > > nutshell, we are trying to figure out timelines for deprecation of the > > YARN-alpha API's as Yahoo is now moving off of them. It's helpful for > > us to have a sense of whether anyone else uses these. > > > > Please comment on the JIRA if you have feeback, thanks! > > > > [1] https://issues.apache.org/jira/browse/SPARK-3445 > > > > - Patrick > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: Question about SparkSQL and Hive-on-Spark
Hi Yi, I've had some interest in implementing windowing and rollup in particular for some of my applications but haven't had them on the front of my plate yet. If you need them as well, I'm happy to start taking a look this week. best, wb - Original Message - > From: "Yi Tian" > To: dev@spark.apache.org > Sent: Tuesday, September 23, 2014 2:47:17 AM > Subject: Question about SparkSQL and Hive-on-Spark > > Hi all, > > I have some questions about the SparkSQL and Hive-on-Spark > > Will SparkSQL support all the hive feature in the future? or just making hive > as a datasource of Spark? > > From Spark 1.1.0 , we have thrift-server support running hql on spark. Will > this feature be replaced by Hive on Spark? > > The reason for asking these questions is that we found some hive functions > are not running well on SparkSQL ( like window function, cube and rollup > function) > > Is it worth for making effort on implement these functions with SparkSQL? > Could you guys give some advices ? > > thank you. > > > Best Regards, > > Yi Tian > tianyi.asiai...@gmail.com > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: OutOfMemoryError on parquet SnappyDecompressor
So as a related question, is there any reason the settings in SQLConf aren't read from the spark context's conf? I understand why the sql conf is mutable, but it's not particularly user friendly to have most spark configuration set via e.g. defaults.conf or --properties-file, but for spark sql to ignore those. On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger wrote: > After commit 8856c3d8 switched from gzip to snappy as default parquet > compression codec, I'm seeing the following when trying to read parquet > files saved using the new default (same schema and roughly same size as > files that were previously working): > > java.lang.OutOfMemoryError: Direct buffer memory > java.nio.Bits.reserveMemory(Bits.java:658) > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > > parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99) > > parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readFully(DataInputStream.java:169) > > parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201) > > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521) > > parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493) > > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546) > > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339) > > parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) > > parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) > > parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265) > parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60) > parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) > > parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) > > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) > > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) > > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$class.isEmpty(Iterator.scala:256) > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157) > > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220) > > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219) > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) > > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:722) > > >
RE: spark.local.dir and spark.worker.dir not used
Hi, Spark.local.dir is the one used to write map output data and persistent RDD blocks, but the path of file has been hashed, so you cannot directly find the persistent rdd block files, but definitely it will be in this folders on your worker node. Thanks Jerry From: Priya Ch [mailto:learnings.chitt...@gmail.com] Sent: Tuesday, September 23, 2014 6:31 PM To: u...@spark.apache.org; dev@spark.apache.org Subject: spark.local.dir and spark.worker.dir not used Hi, I am using spark 1.0.0. In my spark code i m trying to persist an rdd to disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the location where the rdd has been written to disk. I specified SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than using the default /tmp directory, but still couldnt see anything in worker directory andspark ocal directory. I also tried specifying the local dir and worker dir from the spark code while defining the SparkConf as conf.set("spark.local.dir", "/home/padma/sparkdir") but the directories are not used. In general which directories spark would be using for map output files, intermediate writes and persisting rdd to disk ? Thanks, Padma Ch
resources allocated for an application
Hi, I am trying to find out where exactly in the spark code are the resources getting allocated for a newly submitted spark application. I have a stand-alone spark cluster. Can someone please direct me to the right part of the code. regards
Re: Question about SparkSQL and Hive-on-Spark
On Tue, Sep 23, 2014 at 12:47 AM, Yi Tian wrote: > Hi all, > > I have some questions about the SparkSQL and Hive-on-Spark > > Will SparkSQL support all the hive feature in the future? or just making > hive as a datasource of Spark? > Most likely not *ALL* Hive features, but almost all common features. > > From Spark 1.1.0 , we have thrift-server support running hql on spark. > Will this feature be replaced by Hive on Spark? > No. > > The reason for asking these questions is that we found some hive functions > are not running well on SparkSQL ( like window function, cube and rollup > function) > Is it worth for making effort on implement these functions with SparkSQL? > Could you guys give some advices ? > Yes absolutely. > > thank you. > > > Best Regards, > > Yi Tian > tianyi.asiai...@gmail.com > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Question about SparkSQL and Hive-on-Spark
Hi all, I have some questions about the SparkSQL and Hive-on-Spark Will SparkSQL support all the hive feature in the future? or just making hive as a datasource of Spark? From Spark 1.1.0 , we have thrift-server support running hql on spark. Will this feature be replaced by Hive on Spark? The reason for asking these questions is that we found some hive functions are not running well on SparkSQL ( like window function, cube and rollup function) Is it worth for making effort on implement these functions with SparkSQL? Could you guys give some advices ? thank you. Best Regards, Yi Tian tianyi.asiai...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org