Bucketing
Hi all, I am trying to bring bucketing functionality and realize it is not allowed on DataFrame write. Any work around for this or any update on when this functionality will be made available in Spark? Thanks - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: writing to local files on a worker
I have been looking at Spark-Blast which calls Blast - a well known C++ program in parallel - In my case I have tried to translate the C++ code to Java but am not getting the same results - it is convoluted - I have code that will call the program and read its results - the only real issue is the program wants local files - their use is convoluted with many seeks so replacement with streaming will not work - as long as my Java code can write to a local file for the duration of one call things can work - I considered in memory files as long as they can be passed to another program - I am willing to have OS specific code So my issue is I need to write 3 files - run a program and read one output file - then all files can be deleted - JNI calls will be hard - this is s program not a library and it is available for worker nodes On Sun, Nov 11, 2018 at 10:52 PM Jörn Franke wrote: > Can you use JNI to call the c++ functionality directly from Java? > > Or you wrap this into a MR step outside Spark and use Hadoop Streaming (it > allows you to use shell scripts as mapper and reducer)? > > You can also write temporary files for each partition and execute the > software within a map step. > > Generally you should not call external applications from Spark. > > > Am 11.11.2018 um 23:13 schrieb Steve Lewis : > > > > I have a problem where a critical step needs to be performed by a third > party c++ application. I can send or install this program on the worker > nodes. I can construct a function holding all the data this program needs > to process. The problem is that the program is designed to read and write > from the local file system. I can call the program from Java and read its > output as a local file - then deleting all temporary files but I doubt > that it is possible to get the program to read from hdfs or any shared file > system. > > My question is can a function running on a worker node create temporary > files and pass the names of these to a local process assuming everything is > cleaned up after the call? > > > > -- > > Steven M. Lewis PhD > > 4221 105th Ave NE > > Kirkland, WA 98033 > > 206-384-1340 (cell) > > Skype lordjoe_com > > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
question about barrier execution mode in Spark 2.4.0
Hello, I was reading Spark 2.4.0 release docs and I'd like to find out more about barrier execution mode. In particular I'd like to know what happens when number of partitions exceeds number of nodes (which I think is allowed, Spark tuning doc mentions that)? Does Spark guarantee that all tasks process all partitions simultaneously? If not then how does barrier mode handle partitions that are waiting to be processed? If there are partitions waiting to be processed then I don't think it's possible to send all data from given stage to a DL process, even when using barrier mode? Thanks a lot, Joe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Questions on Python support with Spark
I've never tried to run a stand-alone cluster alongside hadoop, but why not run Spark as a yarn application? That way it can absolutely (in fact preferably) use the distributed file system. On Fri, Nov 9, 2018 at 5:04 PM, Arijit Tarafdar wrote: > Hello All, > > > > We have a requirement to run PySpark in standalone cluster mode and also > reference python libraries (egg/wheel) which are not local but placed in a > distributed storage like HDFS. From the code it looks like none of cases > are supported. > > > > Questions are: > > > >1. Why is PySpark supported only in standalone client mode? >2. Why –py-files only support local files and not files stored in >remote stores? > > > > We will like to update the Spark code to support these scenarios but just > want to be aware of any technical difficulties that the community has faced > while trying to support those. > > > > Thanks, Arijit >
Re: FW: Spark2 and Hive metastore
In order for the Spark to see Hive metastore you need to build Spark Session accordingly: val spark = SparkSession.builder() .master("local[2]") .appName("myApp") .config("hive.metastore.uris","thrift://localhost:9083") .enableHiveSupport() .getOrCreate() On Mon, Nov 12, 2018 at 11:49 AM Ирина Шершукова wrote: > > > hello guys, spark2.1.0 couldn’t connect to existing Hive metastore. > > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Spark-Core] Long scheduling delays (1+ hour)
Forgot to add the link https://jira.apache.org/jira/browse/KAFKA-5649 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org